Saturday, February 28, 2009

A statistics book not to miss

Chris Blattman recently suggested a new book by Stephen Ziliak and Deirdre McCloskey, The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. The title was sufficiently inflammatory, so I decided to give it a read. It is officially my favorite book I’ve read in the last year. What follows is a simple review, with a big take-away: doing good science means more than just blindly using 5% or 10% levels of statistical significance testing.

In full disclosure, I must admit to a slight bias in liking the book. A few years ago a professor at UC Irvine called a paper of mine on the economic impact of forced migration to be of little interest, since the scientific question should really be the sign of the effect (that is, if the effect is different from zero, and in what direction), and clearly we all can guess that forced migration is bad. I immediately disagreed that sign is all that matters, though I had no developed argument to stand on. I ignored his opinion and went ahead with the project. I am glad to now find in Ziliak and McCloskey not only common supporters in believing that size of effect matters, but that only looking at sign is in fact not scientific. To quote Ziliak and McCloskey: “real science depends on size, on magnitude”.

The main problem the book addresses is the tendency of researchers to only look at the statistical significance of their coefficients, and then leave the question at that. Statistical significance is not the same as real world significance, but is instead a measure of the precision of the estimation. The authors give a wonderful example, summarized by Times Higher Education:
Suppose we have two diet pills, and you want to lose 5lb. One pill promises you that you'll lose 20lb, give or take 14, while the other promises that you'll lose 5lb, give or take 0.5lb. While it's true that the first pill's outcomes are more erratic, they are also more effective. You would take the second pill only if you wanted your weight loss to deviate as little as possible from exactly 5lb.

Perhaps some people think about weight loss this way. If so, they happily conflate statistical and policy significance. However, that is probably not how most researchers, or their clients, think about the matter. They are seeking effectiveness - what the authors call "oomph" - but they are settling instead for mere precision under the guise of "statistical significance".
One pill then gives a big, though varied, effect, while the other offers precision, with a much smaller effect. Since our original desire was to lose weight, we should actually prefer the one with the larger effect, but the now standard non-Bayesian statistical test, the “p-value”, in fact has us prefer the more precise pill. The size of effect matters, not just (or in this case, not at all) it’s precision.

While the p-value does not actually give us a probability of our estimate being correct, it is commonly used like it does. But even under this incorrect interpretation, we are wrong to focus on it alone. If an astronomer says there is a 60% chance an asteroid is going to hit the planet and wipe out the human race, do we ignore him since he is not 90% or 95% certain? Definitely not. Why? Because the size of the effect (wiping out the human race) is too big to ignore. The same goes with global warming and the effect of Vioxx on heart attacks (also well discussed by Ziliak and McCloskey).

So the question is: what is the solution to using p-values? To those of us not interested in converting to the Bayesian methodology (I personally see little reason to add the complexity: if the errors have a normal distribution, the OLS estimate is the same as the maximum likelihood estimate, which is the same as a Bayesian estimate with a completely diffuse prior, which is a reasonable prior for most issues) there is an answer, which, while simple sounding, is not normally followed by researchers in practice: don’t rely simply on one blind test. Specifically, citing comments by Kenneth Rothman on page 165, researchers should instead focus on “measurement and interpretation of size of effects, confidence intervals, and examination of power functions with respect to effect size”.

So, don’t ignore the meteor just because it’s at a 12% rather than a 10% or 5% statistical significance, and use a range of values to better understand what the effect likely is and if that effect is truly of interest to us. In general, researchers need to be more careful about ensuring results are interpretable, and not just empty sign tests that give no clue to whether we should care about the effect.

The problem of power mentioned in the quote is more technical, but it’s basically a problem of excessive skepticism. When sample sizes are small, error estimations can be very noisy, and so a coefficient may be rejected as not statistically significant only because we don’t have the statistical power to isolate the effect. For instance, this was a problem for trials of the effect of Vioxx on heart attacks, which had a small number of trial participants, and the percentage occurance was quite small (though very important to those taking the drug). Rejecting the effect may in itself be a serious mistake.

Overall, I think this is an important book. The ideas in it definitely aren't new, they've just never been so well argued.

A final note: despite the general quality of the technical discussion, the authors end the book with an almost personal attack against Ronald Fisher, who used the work of William “Student” Gosset to develop the now standard idolatry of the p-value. I would have enjoyed the book even more if I had skipped chapters 20-22. Ziliak and McCloskey must surely be glad the dead can’t sue for libel.

No comments: