Posts tagged with robust statistics

As nice as it is to be able to assume normality, … there are problems. The most obvious problem is that we could be wrong.

One … very nice thing … is that, in many situations, … [being wrong] won’t send us immediately to jail without passing “Go.” Under a … broad set of conditions … our assumption [could be wrong, yet we] get away with it. By this I mean that our answer may still be correct even if our assumption is false. This is what we mean when we speak of a [statistic] … being robust.

However, this still leaves at least two problems. In the first place, it is not hard to create reasonable data that violate a normality (or homogeneity of variance) assumption and have “true” answers that are quite different from the answer we would get by making a normality assumption. In other words, we can’t always get away with violating assumptions. Second, there are many situations where even with normality, we don’t know enough about the statistic we are using to draw the appropriate inferences.

One way to look at bootstrap procedures is as procedures for handling data when we are not willing to make assumptions about the parameters of the populations from which we sampled. The most that we are willing to assume (and it is an absolutely critical assumption) is that the data we have are a reasonable representation of the population from which they came. We then resample from the pool of data that we have, and draw inferences about the corresponding population and its parameters.

The second way to look at bootstrap procedures is to think of them as what we use when we don’t know enough.

David Howell

(Source: uvm.edu)

I’ll go into more technicality about robust data analysis elsewhere. Here I want to put forward the simplest argument for it. (This is repeated probably verbatim from Karen Kafadar.)

Say you have 5 independent estimates. Estimates of something important and you’re going to make an important decision based on what the true story is. These numbers are in thousands of dollars, because it’s important.

  • $77,010 k
  • $76,778 k
  • $79,8344 k
  • $78,652 k
  • $78,136

Oops, there is a typo but you don’t notice that. Having read The Wisdom of Crowds and the Central Limit Theorem you naturally average these estimates together to cancel out possible biases or inaccuracies.  This is part of a much larger project with many more numbers (which is why you didn’t notice the typo) and you using common sense on the numbers, just plugging them into your analytic tools.

  • Result: $221,800 k.
    Due to your unnoticed typo, the analysis is majorly wrong, the decision that follows on it is majorly wrong, and everybody loses.

Let’s say you had used the median instead of the mean. Nobody tells you in Stat 101 that the median is much more robust, nor do they talk about trimming, letter-value plots, tri-means, five-number summary, etc.

  • Result: $78,140 k

Yes, regardless of the typo the result is broadly correct. Correct data would have shown mean ± SD of $78,080 k ± $624 k so the median is in bounds.

Of course, I just made this data up — but pick a distribution and generate 100 random numbers with it, then inject an extra digit into one or more of them and see the results on the mean and on the median. You can analyze the differences with calculus but I think the intuition is obvious enough that I can just leave it there.

It’s unbelievable that I didn’t learn these methods until graduate school. Undergraduate journalism majors are taught beta’s, p-values, null hypothesis versus alternative hypothesis, and theoretical “samples” “populations” and “experiments”. But they don’t do, like, simple, common sense data analysis. Just poking around without heavy math tools to ask natural questions.

Read More