Moving beyond P values: Everyday data analysis with estimation plots

Joses Ho, Tayfun Tumkaya, Sameer Aryal, Hyungwon Choi, Adam Claridge-Chang

Preprint posted on July 26, 2018

Article now published in Nature Methods at

A visual, intuitive and widely accessible tool could finally help us move from asking “does it?” to “how much?”

Selected by Gautam Dey


Statistical analysis in the biological sciences has long been dominated by null-hypothesis significance testing (NHST). Statisticians and quantitatively-minded biologists alike have been crying themselves hoarse about the fallacies and intrinsic limitations associated with this approach for, believe it or not, approximately 75 years1,2. Unfortunately, there has been little consensus on the practical steps needed to achieve significant reform.

The authors illustrate the key limitations of NHST as well as their proposed solution using an experimental setup we are all too familiar with: one containing two groups of data points, representing a control and a test/intervention sample. Such an experiment would be traditionally visualized using bar graphs (Fig 1A), box plots (Fig 1B) or perhaps scatterplots (Fig 1C) and analyzed by a Student’s t-test or related NHST variant.

Figure 1: Reproduced from Figure 1 of Ho et al. 2018 under a CC-BY-NC-ND 4.0 international license. 2-groups data represented by bar plots (A), box plots (B), and scatter plots with jitter (C). (D) Histogram-like scatter plots with jitter, with null-hypothesis distribution and p-value (red segment). (E) Estimation plot with difference of means distribution and 95% CI (red line).


What is wrong with the status quo?  

  • The NHST focuses purely on a binary decision3 to accept or reject the null hypothesis (that the means of both groups are identical) and diverts attention away from the actual effect size; this is emphasized by bar plots and only moderately mitigated by box and scatter plots.
  • Visualizing the null distribution and the p-value threshold (Fig 1D, red tail) helps drive home the issues with NHST. First, even an infinitesimally small intervention to any real system will produce at least some effect, making the zero-effect hypothesis intrinsically flawed4. Second, since the p-value threshold (usually 0.05) actually lies within the tail of the null distribution, we are concluding that control and test samples are different by demonstrating that they are sometimes the same!


How to fix it?

  • Estimation plots focus on the difference of means (Fig 1E). The visual representation helps focus attention on the effect size, which is what we (should) actually care about. The 95% confidence interval5 (red bar in Fig 1E), that encompasses the bulk of the ∆ sampling-error distribution (by definition), is more intuitively grasped and much better behaved than the p-value. In this case, we are concluding that control and test samples are different by demonstrating that they are almost always different.


Why I chose this preprint  

I loved this preprint! The estimation plot provides a complete yet visually accessible description of the data- and working through the steps in Figure 1 has given me a visual framework to interpret what I thought I understood about hypothesis testing. More importantly, the authors go to great lengths to make estimation plotting broadly accessible- by providing 5 different ways in which to create them, ranging from Python code to a handy web tool that requires no programming experience whatsoever. Go ahead- try it out!



  1. Berkson, J. Tests of Significance Considered as Evidence. J. Am. Stat. Assoc. 37, 325–335 (1942).
  2. Halsey, L. G., Curran-Everett, D., Vowler, S. L. & Drummond, G. B. The fickle P value generates irreproducible results. Nat. Methods 12, 179–185 (2015).
  3. McShane, B. B. & Gal, D. Statistical Significance and the Dichotomization of Evidence. J. Am. Stat. Assoc. 112, 885–895 (2017).
  4. Cohen, J. The earth is round (p < .05). Am. Psychol. 49, 997–1003 (1994).
  5. Cumming, G. Understanding The New Statistics. (Routledge, 2011). doi:10.4324/9780203807002

Tags: quantitative biology, significance testing, statistics for biology

Posted on: 1st August 2018 , updated on: 3rd August 2018

Read preprint (5 votes)

  • A brief interview with the authors

    Joses Ho and Adam Claridge-Chang shared

    Could you tell us a little bit about how the project started? For example, was the tool a side effect of your ongoing work on estimation statistics, motivated by the needs of other research projects in the group, or a directed effort to address a general shortcoming in the field?

    It started back when Adam and I overlapped at Oxford’s human genetics centre. It is a hub of activity around genome-wide association studies (GWAS), and uses a host of sophisticated statistical tools. As part of my PhD on language genetics, I became familiar with GWAS p-values and the odds ratio, a number GWAS uses to express relative disease risk. So that experience was my first contact with effect sizes.

    Around the same time, Adam, who does experimental neurogenetics, was frustrated by the p-value rollercoaster that so many experience: one day a phenotype is significant, next day it isn’t. He had also heard about effect sizes at Oxford, and when he moved to Singapore took to time to read some text books on the topic, including Statistics with Confidence by Douglas Altman and others, and Geoff Cumming’s Understanding the New Statistics. The concepts and tools in those books are pretty eye-opening.

    So when I graduated and returned to Singapore to start in Adam’s lab as the resident data scientist, he handed me a pile of these textbooks to read and retrain in estimation statistics. Since then, we have used meta-analysis (which is used widely in clinical settings) to synthesise thirty years of short-term memory in flies, and to systematically review over 300 preclinical studies in rodent anxiety. Our paper on fly anxiety-like behaviours used meta-analytic data to compare our results to rodent studies, and also used estimation statistics to analyse and present our results.

    Adam also loans new lab members his well-worn copy of Edward Tufte’s The Visual Display of Quantitative Information, and our group makes an effort to apply Tuftian principles when working on figures for manuscripts. In early 2016, Adam remarked to me that the confidence intervals for the effect sizes could be depicted the way Gardner and Altman did in their textbook (See Figure 1 and 2 in this PDF), and also that we could use bootstrap methods to obtain the full effect-size distribution (‘∆ curve’).

    The benefits of using the bootstrap were immediately obvious: we did not have to make assumptions about the underlying population (which Gardner, Altman, and Cumming do), and I could depict the confidence interval as a graded distribution, and so indicate a likelihood of values for the effect size rather than just a point estimate and hard error-bar boundaries.

    I started writing a version in Python for internal lab use, and along the way we gave it the name Data Analysis with Bootstrap-coupled ESTimation (DABEST). The first version of DABEST and the webapp were released in late 2017.

    So it really grew out of our own frustration with significance testing, and a desire for better tools for ourselves. Then, once we were happy with it internally, it made sense to share it with everyone else.

    As you discuss in your paper, statisticians and biologists alike have been working on alternatives to NHST statistics for years without any sort of consensus in the community. Do you think the easy accessibility and visual nature of your tool could help shift the balance a bit? Your preprint has already triggered significant discussion on social media platforms- do you think this could be leveraged into lasting impact?

    Student’s t-test has incredible brand recognition among scientists, so a key motivation for the creation of the webapp was indeed an attempt to improve the branding of estimation methods. We’re not exactly marketing experts, but we hope that improving awareness and accessibility will encourage some to make the switch. Adam has given several talks in the past where he has tried to get scientists to use these estimation as an alternative to NHST. In doing this he realised he needed a simple handle people could easily grasp and remember, so decided on ‘estimation statistics’. I also attempted to get other laboratories to use my Python code, but the need to learn programming was a major barrier to adoption, so it became clear I needed to be able to say: “There’s an app for that.”

    While we were targeting basic biomedical researchers, one surprise is that our tool has gotten a fair amount of interest from other areas: ecologists,  sports scientists, psychologists and others. We do hope that estimation plots have the potential to change the data-analysis culture. Still, p-values having been under fire for over 75 years, and they are still going strong—so maybe we’ll be doomed to use them forever?

    3. Anything else you’d like to tell us about the paper,, or what’s next for you and your research group?

    We’ve submitted the paper, and hope to see it in print, but are encouraged and pleased with the reception the preprint’s gotten. v0.1.4 of DABEST, which features aesthetic tweaks, will be released very shortly as well.

    Have your say

    Your email address will not be published. Required fields are marked *

    This site uses Akismet to reduce spam. Learn how your comment data is processed.

    Sign up to customise the site to your preferences and to receive alerts

    Register here