Found 6 matches for “robust”.

`@IgorCarron` writes about interesting stuff going on at the vanguard of applied maths research every week.

This week: robust PCA & boy bands, prediction in dynamic graph sequences, streaming algorithms, sparse low-rank matrices, Johnson-Lindenstrauß transform, Walsh transform, implicit ranking of products by a clickstream, high codimension, graph similarity,

hi-res

As nice as it is to be able to assume normality, … there are problems. The most obvious problem is that we could be wrong.

One … very nice thing … is that, in many situations, … [being wrong] won’t send us immediately to jail without passing “Go.” Under a … broad set of conditions … our assumption [could be wrong, yet we] get away with it. By this I mean that our answer may still be correct even if our assumption is false. This is what we mean when we speak of a [statistic] … being robust.

However, this still leaves at least two problems. In the first place, it is not hard to create reasonable data that violate a normality (or homogeneity of variance) assumption and have “true” answers that are quite different from the answer we would get by making a normality assumption. In other words, we can’t always get away with violating assumptions. Second, there are many situations where even with normality, we don’t know enough about the statistic we are using to draw the appropriate inferences.

One way to look at bootstrap procedures is as procedures for handling data when we are not willing to make assumptions about the parameters of the populations from which we sampled. The most that we are willing to assume (and it is an absolutely critical assumption) is that the data we have are a reasonable representation of the population from which they came. We then resample from the pool of data that we have, and draw inferences about the corresponding population and its parameters.

The second way to look at bootstrap procedures is to think of them as what we use when we don’t know enough.

David Howell

(Source: uvm.edu)

Upon my return [to academia, after years of private statistical consulting], I started reading the Annals of Statistics … and was bemused. Every article started with:

Assume that the data are generated by the following model…

followed by mathematics exploring inference, hypothesis testing, and asymptotics…. I [have a] very low … opinion … of the theory published in the Annals of Statistics. [S]tatistics [is] a science that deals with data.

The linear regression model led to many erroneous conclusions that appeared in journal articles waving the 5% significance level without knowing whether the model fit the data. Nowadays, I think most statisticians will agree that this is a suspect way to arrive at conclusions.

In the mid-1980s … A new research community … sprang up. Their goal was predictive accuracy….. They began working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.

The advances in methodology and increases in predictive accuracy since the mid-1980s that have occurred in the research of machine learning has been phenomenal…. What has been learned? The three lessons that seem most important:

• Rashomon: the multiplicity of good models;
•           • Occam: the conflict between simplicity and accuracy;
•           • Bellman: dimensionality — blessing or curse

Leo Breiman, The Two Cultures of Statistics (2001)

(which are: machine learning / artificial intelligence / algorithmists —vs— model builders / statistics / econometrics / psychometrics)

## Use Robust Statistics

I’ll go into more technicality about robust data analysis elsewhere. Here I want to put forward the simplest argument for it. (This is repeated probably verbatim from Karen Kafadar.)

Say you have 5 independent estimates. Estimates of something important and you’re going to make an important decision based on what the true story is. These numbers are in thousands of dollars, because it’s important.

• \$77,010 k
• \$76,778 k
• \$79,8344 k
• \$78,652 k
• \$78,136

Oops, there is a typo but you don’t notice that. Having read The Wisdom of Crowds and the Central Limit Theorem you naturally average these estimates together to cancel out possible biases or inaccuracies.  This is part of a much larger project with many more numbers (which is why you didn’t notice the typo) and you using common sense on the numbers, just plugging them into your analytic tools.

• Result: \$221,800 k.
Due to your unnoticed typo, the analysis is majorly wrong, the decision that follows on it is majorly wrong, and everybody loses.

Let’s say you had used the median instead of the mean. Nobody tells you in Stat 101 that the median is much more robust, nor do they talk about trimming, letter-value plots, tri-means, five-number summary, etc.

• Result: \$78,140 k

Yes, regardless of the typo the result is broadly correct. Correct data would have shown mean ± SD of \$78,080 k ± \$624 k so the median is in bounds.

Of course, I just made this data up — but pick a distribution and generate 100 random numbers with it, then inject an extra digit into one or more of them and see the results on the mean and on the median. You can analyze the differences with calculus but I think the intuition is obvious enough that I can just leave it there.

It’s unbelievable that I didn’t learn these methods until graduate school. Undergraduate journalism majors are taught beta’s, p-values, null hypothesis versus alternative hypothesis, and theoretical “samples” “populations” and “experiments”. But they don’t do, like, simple, common sense data analysis. Just poking around without heavy math tools to ask natural questions.

This is the best quant finance book I’ve yet read.  The symbols on the cover may look daunting, but the text actually keeps notation simple.  Many topics are covered quickly and accessibly; this is a maths book you can actually skim, or skip around in.  I think that’s due to good writing.

Also:  I stand firmly in the Robust camp.  After my class with Karen Kafadar, I’m confident that Robust models are easier to explain and more reliable.  Her typical example was to mis-type just one of the data by repeating a digit or moving the decimal place — and how likely is that! — and see how much the output changed.  Ideally your real-world recommendation shouldn’t change too much based on just one data point.  (If that’s unavoidable, you should withdraw any recommendation.)

So many mathematical questions or ideas yield up a flowering of possible tweaks and adjustments that can be made to a model, with no recommendation of which parameter value to use.  A good answer is:  whatever is most stable across different potential scenarios.

There is a wide variance among the Frank J. Fabozzi series (Advanced Stochastic Optimization, for example, is way worse than this).  If you only have time to read one, read this one.

hi-res

## Anscombes quartet

The four data sets are different, yet they have the same “line of best fit” as computed by ordinary least squares regression.

hi-res