If you polled 100 scientists at your next conference with the single question, “Is there publication bias in your field?” I would predict nearly 100% respondents to reply “Yes.” How do they know? Did they need to read about a thorough investigation of many journals to come to that conclusion? No, they know because they have all experienced publication bias firsthand.

Until recently, researchers had scant opportunity to publish their experiments that didn’t “work” (and most times they still can’t, but now at least they can share them online unpublished). Anyone who has tried to publish a result in which all of their main findings were not “significant,” or who has had a reviewer ask them to collect more subjects in order to lower their p-value (a big no-no), or who neglect to submit to a conference when the results were null, or who have seen colleagues tweak and re-run experiments that failed to reach significance only to stop when one does, knows publication bias exists. They know that if they don’t have a uniformly “positive” result then it won’t be taken seriously. **The basic reality is this: If you do research in any serious capacity, you have experienced (and probably contributed to) publication bias in your field. **

Greg Francis thinks that we should be able to point out certain research topics or journals (that we already know to be biased toward positive results) and confirm that they are biased- using the Test of Excess Significance. This is a method developed by Ioannidis and Trikalinos (2007). The logic of the test is that of a traditional null-hypothesis test, and I’ll quote from Francis’s latest paper published in PLOS One (Francis et al., 2014):

We start by supposing proper data collection and analysis for each experiment along with full reporting of all experimental outcomes related to the theoretical ideas. Such suppositions are similar to the null hypothesis in standard hypothesis testing. We then identify the magnitude of the reported effects and estimate the probability of success for experiments like those reported. Finally, we compute a joint success probability Ptes, across the full set of experiments, which estimates the probability that experiments like the ones reported would produce outcomes at least as successful as those actually reported. … The Ptes value plays a role similar to the P value in standard hypothesis testing, with a small Ptes suggesting that the starting suppositions are not entirely correct and that, instead, there appears to be a problems with data collection, analysis, or publication of relevant findings. In essence, if Ptes is small, then the published findings … appear “too good to be true” (pg. 3).

So it is a basic null-hypothesis significance test. I personally don’t see the point of this test since we already know with certainty that the answer to the question, “Is there publication bias in this topic?” is unequivocally “Yes.” So every case that the test finds not to be biased is a false-negative. But as Daniel Lakens said, “anyone is free to try anything in science,” a sentiment with which I agree wholeheartedly. And I would be a real hypocrite if I thought Francis shouldn’t share his new favorite method even if it turns out it really doesn’t work very well. But if he is going to continue to apply this test and actually *name* authors who he thinks are engaging specific questionable publishing practices, then he should *at the very least* include a “limitations of this method” section in every paper, wherein he *at least* cites his critics. He should also *at least* ask the original authors he is investigating for comments, since the original authors are the only ones who know the true state of their publication process. I am surprised that the reviewers and editor of this manuscript did not stop and ask themselves (or Francis), “It can’t be so cut and dried, can it?”

**Why the Test for Excess Significance does not work**

So on to the fun stuff. There are many reasons why this test cannot achieve its intended goals, and many reasons why we should take Francis’s claims with a grain of salt. This list is not at all arranged in order of importance, but in order of his critics listed in the JMP special issue (excluding Ioannidis and Gelman because of space and relevance concerns). I selected the points that I think most clearly highlight the poor validity of this testing procedure. This list gets long, so you can skip to the Conclusion (tl;dr) below for a summary.

*Vandekerckhove, Guan, Styrcula, 2013*

- Using Monte Carlo simulations, Vandekerckhove and colleagues show that when used to censor studies that seem too good to be true in a 100% publication biased environment, the test censors almost nothing and the pooled effect size estimates remain as biased as before correction.
- Francis uses a conservative cutoff of .10 when he declares that a set of studies suffers from systematic bias. Vandekerckhove and colleagues simulate how estimates of pooled effect size change if we make the test more conservative by using a cutoff of .80. This has the counter-intuitive effect of
*increasing*the bias in the pooled effect size estimate. In the words of the authors, “Perversely, censoring all but the most consistent-seeming papers … causes*greater*bias in the effect size estimate” (Italics original). - Bottom line: This test cannot be used to adjust pooled effect size estimates by accounting for publication bias.

- Francis acknowledges that there can be times when the test returns a significant result when publication bias is small. Indeed, there is no way to distinguish between different amounts of publication bias by comparing different Ptes values (remember the rules of comparing p-values). Francis nevertheless argues that we should assume any significant Ptes result to indicate an important level of publication bias. Repeat after me: Statistically significant ≠ practically significant. The fact of the matter is, “the mere presence of publication bias
*does not*imply it is consequential” and by extension “*does not*warrant fully ignoring the underlying data” (Italics original). Francis continues to ignore these facts. [as an aside; If he can come up with a way to quantify the amount of bias in an article (and not just state bias is present) then*maybe*the method could be taken seriously.] - Francis’s critiques themselves suffer from publication bias, invalidating the reported Ptes-values. While Francis believes this is not relevant because he is critiquing unrelated studies, they are related enough to be written up and published together. While the original topics may indeed be unrelated, “The critiques by Francis, by contrast, are by the same author, published in the same year, conducting the same statistical test, to examine the exact same question.” Hardly unrelated, it would seem.
- If Francis can claim that his reported p-values are accurate because the underlying studies are unrelated, then so too can the original authors. Most reports with multiple studies test effects under different conditions or with different moderators. It goes both ways.

*Johnson, 2013 (pdf hosted with permission of the author)*

- Johnson begins by expressing how he feels being asked to comment on this method: “It is almost as if all parties involved are pretending that p-values reported in the psychological literature have some well-defined meaning and that our goal is to ferret out the few anomalies that have somehow misrepresented a type I error. Nothing, of course, could be farther from the truth.” The “truth is this: as normally reported, p-values and significance tests provide
*the consumer of these statistics*absolutely no protection against rejecting “true” null hypotheses at less than any specified rate smaller than 1.0. P-values … only provide*the experimenter*with such a protection … if she behaves in a scientifically principled way” (Italics added). So Johnson rejects the premise that the test of excess significance is evaluating a meaningful question at all. - This test uses a nominal alpha of .10, quite conservative for most classic statistical tests. Francis’s simulations show, however, that (when assumptions are met and under ideal conditions) the actual type I error rate is far, far lower than the nominal level. This introduces questions of interpretability: How do we interpret the alpha level under different (non-ideal) conditions if the nominal alpha is not informative? Could we adjust it to reflect its actual alpha level? Probably not.
- This test is not straightforward to implement, and one must be knowledgeable about the research question in the paper being investigated and which statistics are relevant to that question. Francis’s application to the Topolinski and Sparenberg (2012) article, for example, is wrought with possible researcher degrees of freedom regarding which test statistics he includes in his analysis.
- If researchers report multiple statistical tests based on the same underlying data, the assumption of independence is violated to an unknown degree, and the reported Ptes-values could range from barely altered at best, to completely invalidated at worst. Heterogeneity of statistical power for tests that are independent also invalidates the resulting Ptes-values, and his method has no way to account for power heterogeneity.
- There is no way to evaluate his sampling process, which is vital in evaluating any p-value (including Ptes). How did he come to analyze
*this*paper, or*this*journal, or*thi*s research topic? How many did he look at before he decided to look at this particular one? Without this knowledge we cannot assess the validity of his reported Ptes-values.

- Bias is a property of a process, not any individual sample. To see this, Morey asks us to imagine that we ask people to generate “random” sequences of 0s and 1s. We
*know*that humans are biased when they do this, and typically alternate 0 and 1 too often. Say we have the sequence 011101000. This shows 4 alternations, exactly as many we would expect from a random process (50%, or 4/8). If we know a human generated this sequence, then regardless of the fact that it conforms perfectly to a random sequence,*it is still biased*. Humans are biased regardless of the sequence they produce. Publication processes are biased regardless of the bias level in studies they produce. Asking which journals or papers or topics show bias is asking the wrong question. We should ask if the*publication proces*s is biased, the answer to which we already*know*is “Yes.” We should focus on changing the process, not singling papers/topics/journals that we already know come from a biased process. - The test assumes a fixed sample size (as does almost every p-value), but most researchers run studies sequentially. Most sets of studies are a result of getting a particular result, tweaking the protocol, getting another result, and repeat until satisfied or out of money/time. We know that p-values are not valid when the sample size is not fixed in advance, and this holds for Francis’s Ptes all the same. It is probably not possible to adjust the test to account for the sequential nature of real world studies, although I would be interested to see a proof.
- The test equates violations of the binomial assumption with the presence of publication bias, which is just silly. Imagine we use the test in a scenario like above (sequential testing) where we know the assumption is violated but we know that all relevant experiments for this paper are published (say, we are the authors). We could reject the (irrelevant) null hypothesis when we can be sure that the study suffers from no publication bias. Further,
~~through simulation~~Morey shows that when true power is .4 or less, “examining experiment sets of 5 or greater will*always*lead to a significant result [Ptes-value], even when there is no publication bias” (Italics original). - Ptes suffers from all of the limitations of p-values, chief of which are that different p-values are not comparable and p is not an effect size (or a measure of evidence at all). Any criticisms of p-values and their interpretation (of which there are too many to list) apply to Ptes.

**Conclusions (tl;dr)**

The test of excess significance suffers from many problems, ranging from answering the wrong questions about bias, to untenable assumptions, to poor performance in correcting effect size estimates for bias, to challenges of interpreting significant Ptes-values. Francis published a rejoinder in which he tries to address these concerns, but I find his rebuttal lacking. For space constraints (this is super long already) I won’t list the points in his reply but I encourage you to read it if you are interested in this method. He disagrees with pretty much every point I’ve listed above, and often claims they are addressing the wrong questions. I contend that he falls into the same trap he warns others to avoid in his rejoinder, that is, “[the significance test can be] inappropriate because the data do not follow the assumptions of the analysis. … As many statisticians have emphasized, scientists need to look at their data and not just blindly apply significance tests.” I completely agree.

Edits: 12/7 correct mistake in Morey summary. 12/8 add links to reviewed commentaries.

References

Francis, G. (2013). Replication, statistical consistency, and publication bias. *Journal of Mathematical Psychology*, *57*(5), 153-169.

Francis, G. (2013). We should focus on the biases that matter: A reply to commentaries. *Journal of Mathematical Psychology*, *57*(5), 190-195.

Francis G, Tanzman J, Matthews WJ (2014) Excess Success for Psychology Articles in the Journal *Science*. *PLoS ONE 9*(12): e114255. doi:10.1371/journal.pone.0114255

Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. *The American Statistician*, *60*(4), 328-331.

Ioannidis, J. P., & Trikalinos, T. A. (2007). An exploratory test for an excess of significant findings. *Clinical Trials*, *4*(3), 245-253.

Johnson, V. E. (2013). On biases in assessing replicability, statistical consistency and publication bias. *Journal of Mathematical Psychology*, *57*(5), 177-179.

Morey, R. D. (2013). The consistency test does not–and cannot–deliver what is advertised: A comment on Francis (2013). *Journal of Mathematical Psychology*,*57*(5), 180-183.

Simonsohn, U. (2013). It really just does not follow, comments on. *Journal of Mathematical Psychology*, *57*(5), 174-176.

Vandekerckhove, J., Guan, M., & Styrcula, S. A. (2013). The consistency test may be too weak to be useful: Its systematic application would not improve effect size estimation in meta-analyses. *Journal of Mathematical Psychology*,*57*(5), 170-173.

Awesome read!

Nice summary. I hope it gets widely read. Just a small note regarding my critiques: The results were all analytic (assuming, optimistically, that the true power is known), not through simulation.

Thanks for clarifying, I’ve edited the post.

Hi Alex,

Thanks for putting together these comments in a single posting. It makes it easier to have a discussion when everything is all together. As you noted, I addressed most of these concerns in my response to the critiques in the Journal of Mathematical Psychology special issue. I think it would have been more useful for you to explain why you felt my response was inadequate, otherwise these types of discussions just keep starting over from the same place. (Maybe such starting over is what is needed to fully evaluate new ideas.)

You nicely summarized the critiques of the Test for Excess Succes (TES); in the following I try to address each point.

Your Introduction:

–You argue that the TES does not tell us anything that we did not already know because bias exists across the field. But knowing there is bias in the field does not equal knowing there is bias in a particular article. An indication of bias across the field does not give scientists much guidance about which effects/theories to believe. My analyses provide information about bias within an article, and thereby gives direct guidance about whether to believe the effects/theories in that article.

Your description of how bias is introduced (e.g., a need to show uniformly successful outcomes in order to get published) may properly characterize the situation; but that description hardly mitigates the problem. Such biased experimental results are still inconsistent with the theoretical claims (by being “too good to be true”).

To put it another way, your view is more pessimistic than mine in that you think all studies are biased. Given that attitude, I wonder why you are not calling for retractions of essentially all psychology-related articles. Do you have some reason to believe the findings are simultaneously biased and valid? The burden of proof is on scientists to demonstrate that their data support their theory. Seemingly biased findings cannot do that without some additional arguments. (None of the articles I analyzed have provided such arguments.)

–Just a short comment about the relation between the TES and hypothesis testing. The basic logic of the TES is similar to hypothesis testing (rare events raise doubts about the validity of a null hypothesis), but there are important differences. In particular, the Ptes value is not a Type I error rate; instead it is an estimated replication rate. A lot of current discussions are citing replication as the best marker for the validity of experimental findings, so I think most people would agree that replication rate is an important measure of experimental validity.

–You seem concerned that I _named authors_ of specific work, but what else could I have done? The field identifies experiments and theories by authors. You also seemed upset that I did not contact the original authors, but I am investigating the articles, not the authors. The most I could have gotten by contact with the original authors is additional information about experimental data sets (e.g., a correlation for within-subject scores) that would have allowed for better estimates of the probability of success. Since my analysis was quite conservative, this additional information would have only lead to lower Ptes values. I analyzed the articles that were written; and if the original authors have additional information (such as to explain that there are unreported experiments or that the theory did not predict the theory but was derived from it), then they are welcome to comment at PLOS One.

–You feel that I should have included a section describing the limitations of the method. Nearly every TES paper I have written describes the limitations. Namely: the method is conservative such that even strongly biased experiments sets may not be detected (because the biased effect sizes lead to exaggerated power), the method cannot identify the nature of the apparent bias, and the appearance of bias does not indicate that the full set of experimental results or theory are necessarily (or completely) invalid. My job as an author is to help readers understand the issues, not to confuse them with irrelevant ideas; so I do not make reference to critiques of the method that I think are irrational or poorly thought out.

OK, on to the review of the specific points raised by the critics in the JMP issue.

–Vandekerckhove, Guan, Styrcula:

The TES was never intended to adjust pooled effect size estimates, although I can see that it implicitly does something like that. The poor performance of using TES as a filter (noted in points 1 and 2 by Vandekerckhove et al.) is true only when 100% of the findings have publication bias. (This is not counterintuitive, I think it is clear that if everything is biased, then a perfect filter for bias is going to leave you with an empty set. An imperfect filter is going to leave you with the biased findings that pass the filter.) My response in JMP included simulations (Figure 2 in my JMP article) that show using the TES as a filter does have benefits when publication bias is less than 100%.

—Simonsohn:

1) It is true that I do not have an effect size measure for bias. It is possible that the observed bias is small and that the effects being analyzed are real. On the other hand it is also possible that the observed bias is large or that the effects being analyzed are false. The burden of proof is on the original authors, and part of that burden is for their experimental results to not appear biased in favor of their theoretical idea. My advice is for readers of a critiqued article to be skeptical about the effects or theory.

In some cases there may be ways to salvage something from a seemingly biased article. The recently developed p-uniform pooling approach may apply in some cases, and in other cases there may be theory-motivated ways of dealing with bias. I have never ignored these possibilities (and p-uniform does not apply to the Science articles because of the inhomogeneity of effect sizes). Without such a method of dealing with bias, ignoring the data seems like a prudent decision (e.g., I recommend ignoring Bem’s data claiming precognition, Ptes=0.058).

2) “Having” publication bias is not the same thing as “suffering” from publication bias. Bias matters relative to the theoretical conclusions. If you tell me about five unrelated experiments that each produce p=0.049, then there is hardly a reason to complain even though the estimated power of each experiment is just around one-half. This set may or may not have publication bias, but it hardly matters because you are not drawing any conclusions from the experiments. On the other hand, if you tell me that five experiments, which each have p=0.049, support your theoretical claims, then I am going to be suspicious about the data or those claims because if the theory is correct then the probability of such experimental success is low (Ptes = 0.5^5 = 0.03125). I am going to suspect that you had other relevant experiments, that you did something improper in the sampling or analysis, or that your theory is partly fitting noise (HARKing).

All of my descriptions of publication bias were (of course) performed by me, but there’s no theoretical claim related to my ability to ferret out publication bias. The fact that I do not publish every publication bias analysis means that you cannot use my one-off investigations to estimate the percentage of seemingly biased articles across the field. Don’t do it! This limitation does not apply to my investigations of Science and Psychological Science articles, where the 83 and 82 percentages are a valid estimate of the occurrence of bias.

3) It does indeed go “both ways”. For example, Whitson & Galinsky (2008) can remove the appearance of bias (Ptes = 0.008) by adding an addendum to their Science paper to clarify that their six experiments are unrelated to each other and do not actually support their theoretical claim that “Lacking control increases illusory pattern perception.” I doubt Science would accept such an addendum and would, instead, just retract the article.

Johnson

1) I am sympathetic to the concerns about frequentist hypothesis testing and p-values, but many (most) researchers in psychology continue to take p-values seriously, so I think it is worthwhile to critique articles that use p-values in that spirit.

2) I did look into adjusting the alpha level so that Ptes could be interpreted as a Type I error rate. Sometimes you can do this with Monte Carlo simulation methods, but there are tricky cases that do not have a clear solution (at least to me). As I have used it, the TES does not try to control the Type I error rate, but instead estimates replicability. At least as I have used it, this approach leads to a quite small Type I error rate. It’s a weakness of the test.

3) It’s not as difficult to apply the TES as it might appear. Articles are often quite clear about how the experimental results relate to theoretical claims. In the particular case critiqued by Johnson he ignored the experimental results that were clearly defined by the original authors (in the title, abstract, and text) as relevant to their theoretical claims. Of course it is possible for me to make a mistake when applying the TES, and if anyone finds an error in the Science article analyses, then I hope they will post a comment at PLOS One.

4) All of these concerns can be accommodated by the TES (often by being more conservative about concluding bias). The methods are described in my JMP target article, the Psych Science analyses, and the Science analyses. If you think I have made a mistake, please discuss it in detail. A vague statement like “there is no way to account for power heterogeneity” is hardly helpful (I’m not even sure what problem you are referring to here).

5) My sampling process within an article is transparent. You can read the original article, and I cannot generate more experiments or analyses than what are reported in the article. Since my conclusions are about the _article_ that is the only sampling process that matters for the bias conclusion. My motivations for selecting any specific article, journal, or research topic do not matter for that conclusion.

If you believe that those motivations really do matter, then you need to ask the same question for essentially all experimental findings in psychology; and you cannot interpret the p-values for a Stroop effect study unless you know why the researcher did not choose to investigate the effect of salinity on star-fish mating preferences. Good luck with that scientific approach.

Morey:

1) If you really believe that the scientific (and publication) process is always biased, then why do you believe any of the claims made in any psychology-related article? I am nowhere near as pessimistic as you.

More generally, what matters is whether experimental findings appear biased relative to the theoretical claims. The TES approach estimates what would happen with replications of the reported studies. If the probability of success for those replications is low then scientists should doubt the reported findings and/or the theoretical claims.

2 & 3) Your descriptions of how experiment sets are generated may be true, but the generation process can produce biased outcomes. Describing a biased scientific method for generating experiment sets or theory hardly means that the TES is invalid. Suppose you generate an experiment set of six experiments (then money ran out), and you get four findings that reject the null (p=0.04, p=0.03, p=0.04, p=0.02) and two findings that do not reject the null (p=0.3, p=0.2). You interpret all of these findings as support for your theory (the non-significant findings identify boundary conditions for the effect). You publish everything so there is no file-drawer bias here. What is the probability that someone replicating your full set of experiments would find the same degree of success (for both the significant and non-significant findings)? If we assume each experiment is a two-sample t-test with 50 subjects in each group, then Ptes = 0.066. So replicators have an estimated 7% chance of achieving the same degree of success. That low replication rate does not engender much confidence in the findings or the theory.

The problem here is how the theory was generated. In the above scenario you transparently HARKed, which means that your theoretical conclusions probably tracked noise that was inevitably present in your experimental results. In short, your theory “overfits” your data, and thereby does not generalize to replication studies. A different theoretical conclusion derived from the experimental data might not appear biased. For example, suppose you interpreted the two non-significant findings as “failures” for the theory. What is the probability of replicators observing 4 or more significant results out of 6 experiments like these? It’s Ptes = 0.24, which might be lower than you would like, but is not so low as to raise serious concerns about bias. Of course, you might have trouble convincing people that your theory is valid when you present 4 successful and 2 unsuccessful experiments; that’s just the way it is.

The discussion about fixed samples seems irrelevant. Clearly the number of samples (in this case experimental results) are determined by the original authors. Thus, they are effectively fixed for my TES analysis.

Morey’s example simply points out that the TES will sometimes make Type I errors (conclude bias when it does not exist). In my JMP response I demonstrated that such errors will be very rare for the case Morey presents. The only way to avoid Type I errors is to not make decisions. You do not have to make a decision about bias; but if you believe that an effect or theory addresses an important topic, then I think it is foolish to ignore information about the relationship between the data and the theory.

4) In my view, the main problem with p-values is that they do not work well when accumulating evidence across experiments. This is a serious limitation because much of our scientific theorizing depends on this type of evidence accumulation. Fortunately, the TES (as I have used it) does not derive a theory by accumulating evidence across experiments. It simply is not true that “any criticisms of p-values” apply to Ptes. If you have specific concerns, please describe them.

I understand that the ideas of TES run contrary to the intuitions that many research psychologists have about probability, experimental power, and replication. Given the misunderstandings of hypothesis testing and replication, I am not surprised that there is also confusion about the TES. We need to work together to better understand how to derive theoretical ideas from noisy data. The TES is only the start of that process.

Best wishes,

Greg Francis

Thanks for the comment! I very much appreciate your reply. I would have liked to go more in depth looking at your rejoinder, it was just unfortunate that the post was already much longer than I typically like to write. This comment is pretty long as well.

Just to start, I can see where you are coming from in not citing the critics of this method when you think they are irrational or wrong. I mischaracterized you if I implied you don’t describe limitations at all. But I think the reader should be made aware that there is not a consensus about the use of this method. The readers of your latest article investigating Science’s excess success rate almost certainly aren’t regular readers of JMP, and so would be under the impression that this method only has the limitations recognized by you- namely being overly conservative and unable to identify the exact sources of bias. From the responses I’ve gotten discussing this method with others it would seem the jury is still out, and I think readers should be made aware of that fact.

To address your questions to me of “do I think all results are biased,” and further “do I think they should all be retracted?”: Yes to the first, no to the second. Here is where I think we differ- I think there are important degrees of relevance to this bias (if we grant that it can be attributed to individual articles), and lumping everything in with Bem’s psi article would be a mistake. To answer your other question directly: Yes I think a study can be biased while still being valid. I can increase my belief in a theory put forth in an article even if I know with certainty that the effect size estimate is likely to be inflated or exaggerated, much the same way I can believe my friends had a good time at one of their parties even though I know they exaggerate their stories. I may not know _exactly_ how much to believe it but I can still have some positive belief in it. There are degrees to my belief. If a method could characterize _how much_ an effect is overestimated, much like p-curve, then I would certainly be more receptive.

It is important (in my opinion) that if one is going to try and identify exaggerations in effect sizes, that one should try to answer the right question. That question is, “How much is this effect being overestimated?” (which TES cannot do) or if you like, “How much should I believe this effect (or how much should I update my beliefs about this theory)?” (which TES cannot do) and not, “Is this effect possibly being overestimated?” (which is what TES tries to do, but I still hold does not). -On a side note, introducing bias can lead to underestimates of effect size too, no?

This is the same reason that classical hypothesis testing using p-values makes little sense to me. I think one of the biggest issues I have with TES is much the same as I have with other long-run frequency statistics. Sample spaces are poorly defined, adjustments for multiple comparisons must be made even with implicit multiple comparisons, and the logical fact that the (un)likelihood of results under a theory cannot be used alone to quantify support for/against a theory. The fact that something is rare under a single hypothesis should not (and really cannot) lead to a belief that that particular hypothesis is likely wrong in this framework. You say this method is not a hypothesis test because it is estimating the amount of successful studies that can be expected to be as successful (or more) than the reported ones. That is EXACTLY what classical hypothesis testing provides. To say TES is not a hypothesis test is misguided.

You say you need not ask for author comments because the most information you could get from authors is akin to within-subject correlations. Could it be possible that they could inform you that the results reported actually include all of the experiments and analyses run, and that they could prove it somehow? I don’t know if they could, but I think by naming authors that you think are showing bias- and by suggesting we should ignore their results- you should offer them a chance to comment since they are the ones who know what actually happened.

If I can make time I will try to address all of your concerns with the other authors’ critiques, but alas it’s not possible for me at the moment so I’ll post what I have so far above. (And I’m sure they can defend themselves if they get impatient with me).

Hi Alex,

Regarding citing the critics. I feel that my JMP response appropriately adressed the arguments raised by the JMP critics. I have not seen any new criticisms raised against the TES, so I am not sure what I was supposed to write about. Of course, new arguments are welcome, and so are arguments against my response.

I don’t think we are so far apart on interpretations of bias and validity (although I think your impression that all results are biased is unjustified). I readily concede that experimental findings/theories may be biased but still have validity. Especially for cases where the experiment set consists of direct replications, something like p-uniform can demonstrate both bias and validity (along with a new estimated effect size). That’s great, and it should be applied in such cases. Unfortunately, it does not apply to most of the Science articles because they reported experiments of different effects as a kind of converging evidence for the theoretical claims. That does not mean there is nothing to be done here, but I suspect that a re-interpretation of the experiments will require a subject matter expert. (Although sometimes there may be nothing that can be done.)

You are right that if one is trying to identify exaggerations in effect size, then you should try to answer that question. But that’s not what the TES is trying to do. Fundamentally, the TES is checking whether the theoretical claims appear consistent with the reported experimental data. (This aspect of the test is why I called it the “consistency test” in my JMP article; but nobody liked that term.) When the experimental data and theoretical claims seem inconsistent, then we can speculate about the reasons why: publication bias, QRPs, HARKing; and those reasons do often introduce an overestimate of effect sizes.

I share many of your concerns about classical hypothesis testing, but many of those fears do not apply to the TES (at least as I have used it). The sample space is well defined and there are no multiple comparisons because everything is defined by the original authors.

I also agree that a low probability of a single hypothesis by itself does not prove that hypothesis is false; we need to consider the alternative hypothesis. In the case of the TES, the single (null) hypothesis is that relevant experiments are fully reported, the analyses are proper, and the sampling is proper. From that hypothesis we estimate the probability of the observed (or better) level of success: Ptes. I think it is obvious that the probability of the observed (or better) data is essentially 1.0 for the alternative hypothesis: If researchers are only reporting successful experiments, then of course the probability of reporting successful experiments is 1.0. I actually considered developing a Bayesian version of the TES that compares these probabilities, but when I ran the idea by a few Bayesian colleagues they were skeptical (mainly because it uses the frequentist approach of integrating under a pdf to get the probabilities).

I don’t see what the original authors could say to remove the appearance of bias in their articles. Even if they tell me that they did not have any unsuccessful experiments, their results still seem rather unbelievable. As we argued in the PLOS One article, bias can sneak in via a lot of methods that appear to be good science, but maybe some of these authors were just really (un)lucky to have so much success. Even so, we scientists can only draw conclusions from the data in front of us. If the experimental data are rare given the theory, then that inconsistency should make us skeptical.

I hope you find time to prepare a more detailed description of your views. I look forward to continuing the discussion.

Best wishes,

Greg

Hi Greg, thanks for continuing to comment here. I’ll try to tackle some of your replies to the JMP comments, using numbers to match your own:

Simonsohn:

1. I think it is interesting that you treat the presence of bias in an article as a pure yes/no decision, given that you acknowledge the .10 cutoff is arbitrary. Like typical p-value decisions around arbitrary cutoffs, surely binning results into completely biased vs. unbiased is oversimplistic? I also find it odd that you say you are less pessimistic than I am but then claim that any presence of bias necessitates we reject the evidential value of any study. Perhaps we are pessimistic in different ways. (For the record, I agree we should ignore Bem’s data but not because it failed this test).

We may now have a method of dealing with systematic overestimates of effect size. Have you seen the latest p-curve paper? It performs incredibly well at adjusting effect sizes. I was not entirely sold on the method, but I think so far it shows a lot of promise. In fact, I think that p-curve does everything that the test of excess significance tries to do but does it all better.

2. Wouldn’t you agree that a method that can actually tell us what kind of evidential value a set of studies contains is better than a method that simply presents studies and claims they are biased and invalid?

Johnson:

1. I personally critique p-values and articles that use them all the time; it is something I feel obligated to do. So I’m with you there. But it is one of the reasons why I don’t buy the utility of Ptes. The logic behind p-values and most other long-run frequency statistics are just not sound.

2. I am skeptical of the logic of this test if there is not an appropriate alpha level. In your articles you write about sets of studies that lead to a rejection of the null hypothesis of essentially perfect experimental execution. If the test does not have an alpha level then you are not rejecting the null in any principled way. Null hypothesis tests are already borderline unprincipled when they use the traditional logic of the tests.

3. I am confident that you’ve used the test correctly.

4. Sorry that wasn’t clear. The main point to get across was that the binomial test assumes independence of test statistics, and unless we know how much different experiments are related then we can’t be sure the results we get are valid.

5. Yes, I believe what you say about your sampling plan. But it is just a fact of significance tests that the intentions of the ones running the test matter. I understand that you are not concerned with type I errors in any regard for this test, but unfortunately I can’t seem to get away from it. If you tell me that you sample a certain way I will believe you, but others using this method may not be as predictable. Since this method is a hypothesis test it matters what relevant comparisons you make. If a researcher has an agenda to discredit certain topics (I am NOT meaning to imply that this is you at all. You seem like a perfectly stand-up guy) then their selection of studies to test will affect the alpha rate. Just like how the data generation process affects alpha, so too does researcher intentions. Since power (genuine or post-hoc) is dependent on alpha, then we cannot be sure that the test is returning accurate results. This is an unfortunate side-effect of using long-run frequency statistics.

Morey:

He can defend his criticisms himself, it would seem. I will say that the apparent pessimism is only so extreme if you take a stance that all biased studies are of no value at all. I do not take that stance and so I don’t think I’m as pessimistic as you make me seem.

4. P-values alone are not just worthless for accumulating evidence across experiments, they cannot be used as evidence in any case. Ptes is no exception.

I don’t think that Ptes is _that_ counterintuitive, for the record. I just don’t think it does what you want it to do because it is in the same boat as regular p-values.

Thanks again for sticking around to hash all of this out.

Thanks for the feedback. My replies follow your numbering system.

Simonsohn:

1a. Like you, I am not a fan of the pure yes/no decision that is involved in classical hypothesis testing. Nevertheless, the audience of scientists I am trying to reach _does_ believe in such decisions. To convince them that there is a problem in how they relate data and theory, I have to present the issues in a way they will understand. There are other ways to do this. You can do a Bayesian analysis and show that some significant p-values correspond to evidence for the null (Wetzels et al, 2011). You can show that for commonly used sample sizes, a random model provides the best average fit to the data (Davis-Stober & Dana, 2014). These are not exactly the same as the TES, but they touch on similar issues. These alternative approaches are valuable, but I am skeptical that they are reaching a broad audience (although I hope they do) because most research psychologists cannot understand the reasoning.

What psychologists do (imperfectly) understand is that they want their results to replicate. That’s how they become convinced that their findings and theories are valid. Ptes provides an estimate of that replication rate for the set of experiments that were used to support the theory. The 0.1 criterion is _way_ below what most research psychologists would interpret as being an acceptable replication rate.

1b. Things like p-uniform and p-curve are great, provided the assumptions are satisfied, but they do not apply to every situation where the TES can be applied. For example, in the PLOS One paper, Table 1 describes the calculations of success probabilities for complex designs (multiple outcomes have to be satisfied for the experiment to be judged a success). The p-uniform and p-curve approaches can only take a p-value from one test. Maybe the p- approaches can be extended; I am not sure.

2. Yes. (I am a little surprised you like the p-curve approach given that it depends on traditional hypothesis tests.)

Johnson:

1. I can respect that attitude. Of course, it means you were already skeptical of all the Science articles to begin with and did not need the TES analysis to convince you that there is a problem. Most research psychologists do not share your attitude, and the TES analysis uses their own methods to highlight the problems. (I agree that there are problems with p-values, but I do not feel they are fundamentally flawed in _every_ situation.)

2. Yeah, it bothers me too. I only use the TES in conservative situations, so I think I am being fair. I do agree it could cause problems in other situations.

3. Others might disagree. I hope they check my work.

4. Yes, the way I calculate success probabilities takes this into account.

5. It’s funny that you simultaneously dislike traditional hypothesis testing but also worry about Type I error. Bayesian and validation methods typically do not worry about Type I error (it’s just not useful). I am pretty sure that my sampling approach has been fair, because it has been defined by the original authors. I do agree that there is room for abuse in other situations. For example, if someone uses the TES to analyse experimental support across multiple articles that are related to a theory, they might proceed chronologically and stop as soon as Ptes goes below 0.1. That would be optional stopping for the TES, and it would be problematic.

Morey:

Regarding pessimism, I guess I am curious as to why you remain optimistic even when you see that many of the articles are biased. I’m not saying there are no good answers to the question. My own feeling (and it really is just a feeling) is that research psychologists want to do good science, but they have been taught poor methods for doing it. I have both optimism and pessimism.

4. As “evidence” in a formal sense, I agree; but I don’t think that makes p-values worthless. For a field that cares about replication, Ptes is related to something researchers care about. I look forward to a day when we trade-in p-values and Ptes for more rational approaches to theory development and testing.

I’m ditching the numbers now. Some general thoughts:

I guess we’ll just have to agree to disagree about p-values and their worth. But I hope you understand that many of my criticisms of Ptes are coming from that perspective.

I am also surprised by my interest in p-curve. While I’m still forming an opinion on it, the performance in simulations to adjust effect sizes is hard to ignore. If psychology is going to keep using long-run frequency stats (looks like it) then I do think we should do our best to focus on estimating reasonable effect sizes. If p-curve’s only use was to say if a set of results had any value at all based on a classic hypothesis test, I wouldn’t be interested.

That’s not to say I think we should abandon hypothesis testing altogether and only focus on effect sizes (I’m not as radical as Geoff Cumming). But we should be using tools that make sense for that purpose.

With regard to type I errors: I don’t think it’s too strange to worry about type I errors if they are a part of the logic of the certain statistical framework being used. Do I think type I errors are problematic? It’s true I don’t really care about them when I’m using bayesian tools. But in my opinion they do matter in the framework where making a type I error has a relevant meaning (I’m not as radical as Andrew Gelman in this respect).

Re pessimism: It’s a great question, how to remain optimistic about a study in the face of systematic bias. The best answer I can give is a rehash of what I said before. I don’t think the mere presence of bias means a set of findings can’t be taken seriously. Do I think they should be scrutinized? Yes, the same way we should scrutinize all reports. If that view is contrary to how most psychologists feel then I guess I’m an oddball (and a bit disappointed).

Greg, why would you, in your recent PLoS article, cite your response to the critiques without labeling it as a response to the critiques (instead incorrectly implying that it is a “[r]ecent investigation[] using the TES”) and not cite any of the critiques at all? You cited your own bookends of the conversation in JMP, without citing any of the critics. This seems misleading to me.

I sent this email to Greg a year ago. I figure since Greg is posting here, he can post his emailed response to me.

###### Begin email

I recently read your response to the comments on your consistency test paper, for the special issue. I was a bit nonplussed when I read your reply to my comments, and I was hoping to engage you further on the issue. If you perhaps think that a fleshing out of these issues would be interesting to others, and best public, there are several people who have blogs and might host a conversation. But to get to the issues…

In particular, your response to my second point seemed to miss the main point. I noted that your model of the publication process was was wrong, and hence the number of expected significant studies in any group is wrong too. I gave an example. But instead of engaging with the main point, you instead engage the example. There are a number of issues with your response to the example.

First, you write, “As he notes, ‘‘If the true power is known to be 0.4 or less, then examining experiment sets of 5 or greater will always lead to a significant result, even when there is no publication bias’’. The final part of the statement is incorrect.”

The “final part” referring, presumably, to “even when there is no publication bias.’” The example explicitly included *all* experiments being published, and thus, by definition, there is no publication bias. Publication bias is the tendency for some papers to get published and others not; at least, that’s the way the term is used in my experience. I don’t get how a statement that is true by definition can be called “incorrect.” Perhaps you could point me to the source that defines publication bias in a way that makes it so.

But you note a different sort of bias: that subsets of studies – namely, those that come from long chains of significant studies, followed by a significant one – will yield biased effect sizes. This is true, but only for groups of studies. Because the marginal probability of a significant effect remains the same under a binomial setting or the geometric setting that I describe, the expected effect size across all studies remains unbiased. By removing the studies from long chains of significant studies, and not removing some of the nonsignificant ones, you are introducing downward bias in the effect size.

An analogous situation would be if everyone published every study separately, but all studies were definitely published. The exact argument you made could be made for eliminating all significant effects. It is trivial to prove that significant studies will yield a biased effect size. But across *all* studies, the effect size will be unbiased. Hence, removing the *known biased* significant effects will yield a biased effect size across all studies. The fact is that QANSR will be unbiased, overall. That you can find subsets of studies that will be biased is true under any strategy, and uninteresting.

The main point here is that you can’t simply remove groups of studies, even if you know they are biased as a group, unless you also know how to remove other studies (say, some of the non-significant ones) to balance out the removal of the ones that are biased high. Knowing what to remove to keep things unbiased requires knowing what the process is that generated the studies, and that is exactly what we don’t know.

With this in mind, the consistency test simply becomes a way to introduce some unknown amount of downward bias into the effect sizes across the literature. How much it is counteracting another upward bias, one doesn’t know.

Finally, you say, “If a scientist practices QANSR but does not inform readers about that strategy, then readers have a false sense about the replicability of the experimental findings. As long as the scientist is up front about the process, then perhaps there is little harm to such a misrepresentation.”

Are you suggesting here that all research must conform to the assumptions of the consistency test, otherwise it is a “misrepresentation”? I know of no researcher who plans doing N studies in advance, then does them all (except maybe in very specific biomedical research). And if you asked them, no researcher would tell you that they did. I don’t see a “misrepresentation” here. Perhaps you could clarify this statement; on its face, it looks like you’re simply have a Procrustean objection to strategies that aren’t the ones you anticipated.

##### End email

I want to clarify a point above (and in my article). My example with sequential testing was *not* meant to suggest that that *particular* process was going on. Francis’ focus on QANSR was missing the point. What I was showing was that if you didn’t know the process by which the individual studies were generated, you could be misled to an astounding degree. I don’t know the process by which any given scientist generates and groups studies. There are probably many. But Francis assumes they can be described by a binomial. This is just crazy; it is not a reasonable model of research. Without knowing the process, you can’t generate a significance test, because you can’t choose a good null hypothesis. And the null hypothesis in the TES is unquestionably bad.

As Richard noted, we had a brief email conversation a year ago about these issues. Unfortunately, we never followed up beyond these two emails (we both thought it might be easier to discuss things face-to-face, but we never arranged a meeting). By the way, the term “consistency test” is what I used in my JMP article to refer to the TES. I have since changed to TES to reflect the wishes of Ioannidis, who developed the TES. Here is my reply to Richard’s initial email (typos intact):

######

Thanks for the feedback, and I would like to mention that I thought your commentary was the most interesting of the lot. It forced me to think about new issues.

A public discussion might be interesting, but I am traveling for next few weeks, so it might be tricky to coordinate. I am moving to spend a sabbatical year in Lausanne, Switzerland. Although not close to Groningen, it might be close enough to warrant a visit if we wanted to discuss things face-to-face.

To the issues…

The term “publication bias” means slightly different things to different people. For lack of a better phrase, I have been using it as kind of a catch-all term for biased results (e.g., a file drawer, optional stopping, HARKing). I keep searching for an alternative, and I am sorry if this is causing confusion.

I focused on your example for point 2 because I thought it was a good example. I do not think you disagree that the 5-experiment QANSR set is usually biased (it overestimates the effect size). We also agree that the consistency test identifies the presence of bias (in this case when the true power of .4 is known).

At least as far as I understand it, your counterargument below is using a different set. You now suggest that in the larger set of QANSR studies (e.g, those stopping after 1, 2, 3, 4, or 5 studies) is not biased. Well, that is true, but it’s hardly a QANSR case anymore. It’s just a large set of studies that were all reported. Indeed, given the .4 power that was hypothesized in the example, around 60% of the studies are going to fail to reject the null. The consistency test is probably not going to indicate bias for this large set.

I focused on the set of 5 because that is where you applied the consistency test. A key issue is what motivates a scientist to present a given set of 5 studies as being meaningful or supportive of a scientific claim/theory. I thought it was implicit in your description that a set of 5 experiments generated with the QANSR process were intended to support some scientific claim. If they were not intended to support some kind of claim, then I see no reason to group them together at all; and thus no reason to apply the consistency analysis. (The analysis can be applied, it just does not give a very meaningful conclusion.)

I think this is a really important point. You noted below, “That you can find subsets of studies that will be biased is true under any strategy, and uninteresting.” I fully agree with this statement. This is why _I_ do not identify the meaningful subsets of studies for the consistency analysis. Subject matter experts identify what they believe to be meaningful subsets of studies. For example, Topolinski & Sparenberg (2012) identified four experiments that they believed supported their theoretical claims. The consistency analysis casts doubt on the appropriateness of those experiments, as related to the theoretical claims.

I think that, in a sense, the consistency test is a form of model checking, which is advocated by Gelman (and others). Whether they intended to or not, Topolinski & Sparenberg (2012) generated a model (the one consistent with the ANOVA and t-tests they ran for their analyses). The consistency test is checking whether the data are consistent with that model.

For the final two paragraphs, I think it is a misrepresentation of the replication rate to use a QANSR method without informing readers, who will suppose that the experiments rate of rejection reflects the replication rate. I think many people really do make that supposition, given that they talk about replication as being the gold standard for empirical work. If everyone used a QANSR method, replication would not provide the confidence people imagine it provides. If people are using methods that inflate the estimated rate of rejecting the null, then they are tossing out the role of replication in science. I’m not entirely opposed to such an approach, especially if researchers used Bayesian methods that take the investigative process into account, but I am not quite sure how that would be handled.

By the way, researchers do claim to run particular studies and report them all. Piff et al. responded to my critique with, “Selective reporting was definitely not at play: we conducted seven studies and reported our findings for each study.”

As for removing groups of studies, I agree that if there really is no bias, then such removal is going to cause problems. My suspicion (and this is partly validated by the simulations I reported in response to the Vandekerckhove et al. commentary) is that the problem is generally small. If there is bias then I think such removal is going to help rather than hurt.

So, I think the main point that your comment is overlooking is that the subset of experiments that are subject to the consistency test are not generated by me, but by authors who are trying to use the set to make a scientific claim. Maybe the same set of experiments could be used to support a different claim, or maybe some of the experiments could be combined with other experiments to support still other claims. To me, that is the interesting part of the scientific process, and it requires much more work than most researchers are doing: data collection, model construction, model testing, model prediction, data collection to test predictions.

I look forward to your comments

#### End of email

Regarding a misrepresentation of the JMP response to the critiques as an application of TES, I can see why you come to that interpretation. The text in the PLOS One article is, “Recent investigations using the TES [12]–[21] have indicated…” (the JMP response is [19]). This was an oversight on my part. In my mind, the references were listing studies _about_ the TES (including its use) but the text implies something else. I am sorry that the text was not more clear, but I think this is a minor issue, especially since the JMP response _does_ discuss a specific use of the TES in response to the Johnson critique.

As for not citing the critics, I meant what I wrote to Alex (above). I thought my JMP response was sufficient and that there was no point in confusing readers with already rebutted criticisms. Of course, you and others are welcome to present new arguments against my rebuttal or against the TES. All I have seen since the JMP response are vague statements (Twitter and elsewhere) from you and others about the TES being a “bad test”, but that’s not really a start for a new conversation.

Regarding your final paragraph, I think either you do not understand what is being investigated by the TES, or I do not understand your complaint about studies being described by a binomial. Are you suggesting that when a theory says that there is a difference between two groups with an estimated effect size of d=0.45, that we cannot conclude that the estimated probability of success (rejecting the null) for 5 independent experiments with n1=n2=50 is the experimental power multiplied five times (0.606^5 = 0.0815)?

By binomial I mean that you assume that a binomial setting (ie, fixed N, same probability of “success”, independence) is reasonable. I’m not sure what you mean by “vague” statements that the model is bad; my comment made clear what that means (maybe you meant that other people have been vague?).

If you assume a binomial setting, and use a binomial probability model for the significant responses (which Ioannidis and you both do), then the p value will be computed as the probability of seeing Y or more significant results using the binomial probability mass function. If the setting is *not* binomial, then the probability of seeing Y or more significant results is not what you think it is (it can be particularly bad, as I mentioned in my comment, if you restrict N by only looking at “long-ish” chains of results). The summation over the probabilities of possible data will be wrong. If the probability of a more extreme result under the null is not what you think it is, then your p value is meaningless. This is well-known (see Berger and Berry, 1988, who note that this makes p values depend on likely unknowable information, a critique to which the TES is vulnerable).

I don’t know a researcher who would say that they fix N before doing a set of experiments, besides those who pre-register sets of experiments.

As to the question of whether it is relevant that *you* were not the ones grouping the studies, this is irrelevant. You are the one who decided to model those studies using a binomial setting. That’s the important thing.

Richard,

It sounds like the only case in which the binomial assumption would be met is in preregistered studies, cases which are sure not to be biased.

Perhaps I can make the point clearer with explicit probability theory. Let’s assume, for the sake of argument, that frequentist probability makes sense to apply in this case (I’m not sure what the “long-run sequence” of events here is, but we can pretend…). Let S and N be random variables that represent the number of significant and non-significant results in a “group” of studies. You only look at cases where S + N > 3, so we need to condition on that fact. We are interested in the probability

P(S & N | S + N > 3) = P(N | S)P(S) / P(N + S > 3)

Now we ask, what is the probability of 5 significant results and 0 nonsignificant ones? This is:

P(N=0 | S = 5)P(S = 5) / P(N + S > 3)

We cannot evaluate this without making some assumptions. If we make the binomial assumption, then N+S is fixed at 5, meaning P(N + S > 3) = 1. Also, N and S are completely redundant, meaning that P(N = 0 | S = 5) = 1. So,

P(S = 5 & N = 0 | S + N > 3) = 1 * P(S = 5) / 1 = P(S = 5)

and we can easily compute this via the binomial probability mass function. If we assume a power of .4, then this becomes .4 ^ 5 = .01. There are no more extreme results than 5/5, so our p value is .01.

But here we had to make the binomial assumption. Other rules lead to other probabilities. Suppose that we stop when we have a single non-significant result, or we reach 5 total studies. What is the probability of 5 significant results and 0 nonsignificant ones?

It is easy to enumerate the possibilities (with probabilities in parentheses)

N (.6)

SN (.24)

SSN (.096)

SSSN (.0384)

SSSSN (0.01536)

SSSSS (0.01024)

P(S + N > 3) = .0384 + 0.01536 + 0.01024 = .064

P(N | S)P(S) = 0.01024

P(S = 5 & N = 0 | S + N > 3) = 0.01024 / .064 = .16

Although in cases like this (where N and S are both random) it is a little trickier to define what “more extreme” results means, I think it is arguable that S=5 is the most extreme result, so our p value is .16.

Now, you don’t have to agree with the simple stopping rule model above to see the point. The point is that the model matters. The p value changes by orders of magnitude depending on which model you pick. In my comment, I even warned about the danger of conditioning on long sets of results, but I see you still did this in the PLoS paper.

You need to be able to argue that you can specify each term in the general formula

P(N | S)P(S) / P(N + S > 3)

Your binomial model makes strong assumptions that enable you (implicitly) to do this, but at the expense of believability.

Suppose that researchers do follow the sequential rule: “stop when we have a single non-significant result, or we reach 5 total studies”. (I agree that the exact rule does not matter much, the principle is the same.)

So, researcher X reports 5 studies that all reject the null (SSSSS) and it is a situation where you and I happen to know the true power of each experiment is 0.4. On the basis of these successful experimental results, researcher X claims support for his/her theory. (Obviously a lot more is involved in relating data to theory, but let us suppose that the crux of the issue is the successful experimental results.)

Other researchers find the results and theory interesting, so they decide to repeat the experiments, using the same sequential rule. As you note, the outcome probabilities are:

N (.6)

SN (.24)

SSN (.096)

SSSN (.0384)

SSSSN (0.01536)

SSSSS (0.01024)

More than half of these replicating researchers will stop after the first (non-significant) experiment, and they will conclude that they failed to replicate researcher X’s findings. Indeed, to get the full success probability we can just add up the probabilities for the failures and subtract from one. Naturally, that gives us the binomial probability of producing 5 successful experiments (0.01024).

The point is, even if all researchers use the sequential rule, the binomial still applies for calculating the probability of 5 successes. Your calculation is ignoring most of the failures generated by the sequential rule because it insists that we only consider those experiment sets that produce at least 4 experiments. But scientists following this sequential rule would not ignore those failures, they would argue that their findings with one, two, or three experiments were contrary to the level of success reported by researcher X. (Mind you, I am not saying that a single negative finding would invalidate research X’s theory; properly relating data to theory is another issue. It’s not clear that any of the above scientists are doing a good job of developing or testing the theory.)

When the full number of failures are taken into account, the binomial calculation of Ptes is the correct one relative to how scientists actually behave, even when they do not have a fixed number of experiments.

Now you’re talking about something (only examine cases where N=0) that a) isn’t your procedure (it just has the same probability in this one extreme case), and b) wasn’t taken into account at all when you computed Type I and Type II error rates (because that you assume it is impossible for N + S to be anything other than what it is in the studies you looked at). You conditioned on N + S in every simulation you did.

But if you choose to include all the possibilities (not just N + S = n + s), you need to include the rejections and failures to reject for all S,N under the null, and marginalize over all possibilities. But unlike with the binomial model which conditions on N+S, you *now* need a weighting function over the possible N+S values. The negative binomial model helpfully provides one, but we know the negative binomial model is wrong. So, you’re stuck not knowing anything about the Type I and Type II error rates.

Your procedure assumes that N+S can only be what it was observed to be. If this is false, the marginalization over data possibilities required to compute all the frequentist probabilities you computed is wrong.

If your previous example was a poor one, then please propose another. I’ve been following the spirit of the TES analysis. The only change is to adapt it to your proposed sequential rule. I’m not following your second paragraph, can you give an example?

Also, under some other sampling procedure (eg, negative binomial), there is the possibility of more extreme results than S = 5, N = 0, so the p value would be different than the one in the “stop after 5” case. The “stop after 5” case is a bit misleading since in both that case and the binomial case the most extreme possible data sets are the same, but that won’t generally be true; after having seen a S = 5, N = 0 case, it could have been possible to observe S = 6, N=0, etc, etc.

[…] There is a lot more to be said about this state of affairs but I won’t go into this because others have already summarized many of the arguments about this much better than I […]

There are two questions here. 1. Are there problems with the TES? Yes, and I have discussed some in Schimmack (2012). 2. Do Greg’s results show that there are major problems with questionable research practices in psychology. The answer is yes. The reason is that p < .1 may be meaningless in a set of 1000 studies (some bias is there, we all know that). But how much bias does it take to get p < .1 even in a small set of 4 or 5 studies? The answer is a lot.

In this case, p-curve is actually helpful. If you put Greg's set of ALL psych studies with 4+ studies into p-curve.com, you learn that the set of studies has NO EVIDENTIAL VALUE. Nothing. Zip. Nada. The null-hypothesis can not be rejected. What a waste.

Ulrich Schimmack

Greg, I think it would be easier to understand my second paragraph if you consider the following question: What are the Type I and Type II error rates of your procedure (under a suitable alternative), under sequential sampling? (just simulate it, and you’ll see what I mean by having to marginalize over N+S in a way that you didn’t under the binomial model)

Richard and I took the conversation off-line for a while to try to hash out our differences. Unfortunately, this did not really help much; except for a few emails to clarify Richard’s concerns, we did not debate the issues. I thought readers of this blog might want to see where the discussion ended so I have enclosed below my one substantive email (Dec. 22) to address Richard’s concerns. Richard has not yet responded, perhaps because this is a busy time of the year.

——————

I wanted to make sure I fully understood your comments, and (coupled with this being a busy time of the year) it has taken a bit of time to prepare a response. I also wanted to try to write up a response that fully addressed your concerns, although I am ready to continue the discussion if necessary. I am cc’ing one of my PLOS One co-authors because I had asked him to consider your most recent email.

I think we do not disagree about the calculation of Ptes as an estimate of the probability of success for future experiments. If the (known) true power of an experiment is 0.4, then the probability of, say, 4 significant outcomes from 4 experiments is 0.4^4=0.0256 (and this calculation is valid whether one runs four experiments in parallel or in a sequential approach that stops as soon as a non-significant experiment is produced).

Your concern is about the Type I error rate generated by the TES’s requirement of only analysing studies with 4 or more experiments. Your argument is that for some sequentially-constructed experiment sets, the TES has a Type I error rate of 100%. I think you are wrong because the experiment sets with fewer than 4 experiments are relevant to the Type I error rate.

Let us take the QANSR example in your JMP comment, where the true power is known (by us) to be 0.4. In this example, a scientist will sequentially run experiments until he gets a non-significant result. He then stops running experiments and publishes everything. Such a scientist happens to get four significant results and one non-significant result. He publishes all of them and makes some theoretical claim on the basis of the data (e.g., the “effect” exists with such and such magnitude). I apply the TES and get

Ptes = 5* (0.4^4)*(0.6) + 0.4^5 = 0.0768 + 0.01024 = 0.08704

which would be interpreted as indicating bias. You say this conclusion is unwarranted because the low Ptes value would occur even though the scientist published all findings. But in this scenario the above outcome is very rare. As my response to your JMP comment pointed out, in my simulations only 1538 of 100,000 experiment sets following this sequential process produced an experiment set like this. Your counterargument is that 1538/100000 cannot be considered as the Type I error rate because we only test the 1538 (or so) experiment sets. Out of _those_ sets the Type I error rate is 100%.

I think to work our way out of the confusion we need to consider some of the first principles. You give it as a default that we are in a situation where there is no publication bias because all experimental results are reported. Let us suppose that the scientist is the first person to investigate this phenomena and just happens to get 4 out of 5 significant results. What are the odds of such an outcome? It is (0.4^4)*0.6 = 0.01536. Since the TES will conclude bias for such an experiment set, this is the Type I error rate for this pattern of experimental outcomes. Since the TES is willing to consider some other outcomes (e.g., 5 out of 5 successes and other orders of significant and non-significant outcomes), the TES process thinks the Type I error rate is a bit higher.

Given the rarity of such an outcome, we can easily imagine that in most such scenarios, the scientist is not the only person to have investigated this phenomenon. Most investigations following the above sequential approach will stop with fewer than five experiments. (Indeed, 60% of the investigations will stop when the first experiment produces a non-significant outcome.) As you noted, the TES will not be applied to most those investigations, but they are certainly relevant for the scientist trying to interpret his 4 out of 5 successful experiments. Bias can only be understood relative to the theoretical claims being proposed by a scientist. Thus, if the scientist only ran the five experiments and reported the 4 significant and 1 non-significant outcomes, he still needs to consider other investigations when making his theoretical claims. Just because he found 4 out of 5 significant effects does not mean that he should ignore the thousands of other investigations that reported non-significant results (remember your example assumes that scientists publish their non-significant outcomes).

A scientist who reports 4 out 5 significant experimental outcomes and interprets them in the context of other investigations is not going to be “caught” by a TES analysis because the theoretical conclusions are not going to be drawn just from the new 4 out of 5 significant experimental outcomes. The simulations in my JMP reply demonstrate that pooling across all QANSR-type experiment sets leads to an accurate estimate of the effect size.

On the other hand, a scientist who ignores other published investigations is cherry-picking results to improperly support a theoretical claim. The TES will properly characterize the inconsistency between the reported data and the theoretical claims. The problem is not that the TES ignores other studies with fewer than 4 experiments (although it does), but that the scientist ignores them. That’s where the bias is introduced that makes the relationship between the reported studies and the theoretical claims unbelievable. Even if the scientist did not personally run additional experiments, they almost surely exist. If journals/authors would not (or could not) publish those experiments, then we have publication bias that drives the inconsistency between the reported experimental results and the theoretical conclusion. In such a case, the TES draws the correct conclusion.

Dr. R just posted the R-Index for the 18 studies in Francis et al. (2014). Take a look how it compares to TES. http://replicationindex.wordpress.com/2014/12/13/the-r-index-for-18-multiple-study-articles-in-science-francis-et-al-2014/