Quantcast

Posts tagged with econometrics

It takes ~20 observations to verify your first significant digit of the mean with confidence.

Do you know how many observations it takes to verify your first sig-fig of the variance? More like 1000. And that’s just to get one digit of accuracy! Higher moments (skew, kurtosis) are even worse.

That’s why I often laugh out loud when I read in the newspaper claims that rely on a certain value of the variance. Even in serious, published papers!—I often see tables with estimates of standard deviation that go out to three decimal places, just because the software spat the numbers out that way. It gives a false sense of accuracy. It’s ridiculous.
Karen Kafadar




I feel vindicated in several ways by the Netflix Engineering team’s recent blog post explaining what they did with the results of the Netflix Prize. What they wrote confirms what I’ve been saying about recommendations as well as my experience designing recommendation engines for clients, in several ways:

  1. Fancy ML techniques don’t matter so much. The winning BellKor/Pragmatic Chaos teams implemented ensemble methods with something like 112 techniques smushed together. You know how many of those the Netflix team implemented? Exactly two: RBM’s and SVD.

    If you’re a would-be internet entrepreneur and your idea relies on some ML but you can’t afford a quant to do the stuff for you, this is good news. Forget learning every cranny of research like Pseudo-Markovian Multibagged Quantile Dark Latent Forests! You can watch an hour-long video on OCW by Gilbert Strang which explains SVD and two hour-long Google Tech Talks by Geoff Hinton on RBM’s. RBM’s are basically a superior subset of neural network with a theoretical basis why it’s superior. SVD is a dimension reduction technique from linear algebra. (There are many Science / Nature papers on dimension reduction in biology; if you don’t have a licence there are paper-request fora on Reddit.)

    Not that I don’t love reading about awesome techniques, or that something other than SVD isn’t sometimes appropriate. (In fact using the right technique on the right portion of the problem is valuable.) What Netflix people are telling us is that, in terms of a Kaggleistic one-shot on the monolithic data set, the diminishing marginal improvements to accuracy from a mega-ensemble algo don’t count as useful knowledge.


  2. Domain knowledge trumps statistical sophistication. This has always been the case in the recommendation engines I’ve done for clients. We spend most of our time trying to understand the space of your customers’ preferences — the cells, the topology, the metric, common-sense bounds, and so on. You can OO program these characteristics. And (see bottom) doing so seems to improve the ML result a lot.

    Another reason you’re probably safe ignoring the bleeding edge of ML research is that most papers develop general techniques, test them on famous data sets, and don’t make use of domain-specific knowledge. You want a specific technique that’s going to work with your customers, not a no-free-lunch-but-optimal-according-to-X academic algorithm. Some Googlers did a sentiment-analysis paper on exactly this topic: all of the text analysis papers they had looked at chose not to optimise on specific characteristics (like keywords or text patterns) known to anyone familiar with restaurant-review data. They were able to achieve a superior solution to that particular problem without fancy new maths, only using common sense and exploration specific to their chosen domain (restaurant reviews).



  3. What you measure matters more than what you squeeze out of the data. The reason I don’t like* Kaggle is that it’s all about squeezing more juice out of existing data. What Netflix has come to understand is that it’s more important to phrase the question differently. The one-to-five-star paradigm is not going to accurately assess their customers’ attitudes toward movies. The similarity space is more like Dr Hinton’s reference to a ten-dimensional library where neighbourhood relationships don’t just go along a Dewey Decimal line but also style, mood, season, director, actors, cinematography, and yes the “People like you” metric (“collaborative filtering”, a spangled bit of jargon).

    For them the preferences evolve fairly quickly over time. That has to make it hard. If your users’ preferences evolve over time: good luck, it may be quite hard.

    John Wilder Tukey: “To statisticians, hubris should mean the kind of pride that fosters an inflated idea of one’s powers and thereby keeps one from being more than marginally helpful to others. … The feeling of “Give me (or more likely even, give my assistant) the data, and I will tell you what the real answer is!” is one we must all fight against again and again, and yet again.” via John D Cook 

Relatedly, a friend of mine who’s doing a Ph.D. in complexity (modularity in Bayesian networks) has been reading the Kaggle fora from time to time. His observation of the Kaggle winners is that they usually win with gross assumptions about either the generating process or the underlying domain. Basically they limit the ML search using common sense and data exploration; that gives them a significant boost in performance (1−AUC).

* I admire @antgoldbloom for following through on his idea and I do think they have a positive impact on the world. Which is much better than the typical “Someone should make X, that would be a great business” or even worse but still typical: “I’ve been saying they should have that!” Still, I do hold to my one point of critique: there’s no back-and-forth in Kaggle’s optimisation.




It is never in good taste to express the sum of two quantities as

  • 1+1=2.

[Everyone] is aware that

and further that
  • 1=sin²q+cos²q

In addition, it is obvious to the casual reader that

  • .
Therefore equation (1) can be rewritten more scientifically as:
  • .

by John Siegfried in the Journal of Political Economy. Hat tip: @unlearningecon

(Source: twitter.com)




As nice as it is to be able to assume normality, … there are problems. The most obvious problem is that we could be wrong.


One … very nice thing … is that, in many situations, … [being wrong] won’t send us immediately to jail without passing “Go.” Under a … broad set of conditions … our assumption [could be wrong, yet we] get away with it. By this I mean that our answer may still be correct even if our assumption is false. This is what we mean when we speak of a [statistic] … being robust.



However, this still leaves at least two problems. In the first place, it is not hard to create reasonable data that violate a normality (or homogeneity of variance) assumption and have “true” answers that are quite different from the answer we would get by making a normality assumption. In other words, we can’t always get away with violating assumptions. Second, there are many situations where even with normality, we don’t know enough about the statistic we are using to draw the appropriate inferences.



One way to look at bootstrap procedures is as procedures for handling data when we are not willing to make assumptions about the parameters of the populations from which we sampled. The most that we are willing to assume (and it is an absolutely critical assumption) is that the data we have are a reasonable representation of the population from which they came. We then resample from the pool of data that we have, and draw inferences about the corresponding population and its parameters.

The second way to look at bootstrap procedures is to think of them as what we use when we don’t know enough.

David Howell

(Source: uvm.edu)




[I]n the late 1920’s and early 1930’s…. There were lots of deep thoughts [in economics], but a lack of quantitative results. … It is usually not of very great practical or even scientific interest to know whether the [causal] influence [of some factor] is positive or negative, if one does not know anything about the strength.


But much worse is the situation when an [outcome] is determined by many different factors at the same time, some factors working in one direction, others in the opposite directions. One could write long papers about so-called tendencies explaining how this … might work…. But what is the … total net effect of all the factors? This question cannot be answered without measures of … strength….

Trygve Haavelmo

Bank of Sweden pseudo-Dynamite Prize Laureate 1989, for work in econometrics

(Source: nobelprize.org)




Upon my return [to academia, after years of private statistical consulting], I started reading the Annals of Statistics … and was bemused. Every article started with:


Assume that the data are generated by the following model…


followed by mathematics exploring inference, hypothesis testing, and asymptotics…. I [have a] very low … opinion … of the theory published in the Annals of Statistics. [S]tatistics [is] a science that deals with data.

The linear regression model led to many erroneous conclusions that appeared in journal articles waving the 5% significance level without knowing whether the model fit the data. Nowadays, I think most statisticians will agree that this is a suspect way to arrive at conclusions.

In the mid-1980s … A new research community … sprang up. Their goal was predictive accuracy….. They began working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.

The advances in methodology and increases in predictive accuracy since the mid-1980s that have occurred in the research of machine learning has been phenomenal…. What has been learned? The three lessons that seem most important:

  • Rashomon: the multiplicity of good models;
  •           • Occam: the conflict between simplicity and accuracy;
  •           • Bellman: dimensionality — blessing or curse

Leo Breiman, The Two Cultures of Statistics (2001)

(which are: machine learning / artificial intelligence / algorithmists —vs— model builders / statistics / econometrics / psychometrics)




What happens if, instead of doing a linear regression with sums of monomial terms, you do the complete opposite? Instead of regressing the phenomenon against  , you regressed the phenomenon against an explanation like  ?

I first thought of this question several years ago whilst living with my sister. She’s a complex person. If I asked her how her day went, and wanted to predict her answer with an equation, I definitely couldn’t use linearly separable terms. That would mean that, if one aspect of her day went well and the other aspect went poorly, the two would even out. Not the case for her. One or two things could totally swing her day all-the-way-to-good or all-the-way-to-bad.

The pattern of her moods and emotional affect has nothing to do with irrationality or moodiness. She’s just an intricate person with a complex utility function.

If you don’t know my sister, you can pick up the point from this well-known stereotype about the difference between men and women:

a well-known stereotype: women are complex, men are simple

“Men are simple, women are complex.” Think about a stereotypical teenage girl describing what made her upset. “It’s not any one thing, it’s everything.”

I.e., nonseparable interaction terms.

I wonder if there’s a mapping that sensibly inverts strongly-interdependent polynomials with monomials — interchanging interdependent equations with separable ones. If so, that could invert our notions of a parsimonious model.

Who says that a model that’s short to write in one particular space or parameterisation is the best one? or the simplest? Some things are better understood when you consider everything at once.




It is a common mistake of inexperienced statisticians to plunge into a complex analysis without paying attention to what the objectives are or to even whether the data are appropriate to the proposed analysis. Look before you leap!

Julian James Faraway, Linear Models with R




Briefly: the linear regression model. We suppose we can explain or predict y using a vector of variables x. As in Gauß’ estimation theory, y is supposed to be unobservable, and thus has to be estimated. The assumption that y depends on x is expressed this way: the posterior distribution Prob{ Y | X } is different from the prior distribution Prob{ Y }.

The minimization of variance of the difference between [our estimation of Y given X] and [Y] leads to a unique solution: the conditional expectation.

The linear hypothesis says that the estimated value should be an affine expression of X. Moreover, the affine parameters which minimise the variance of the error are given by:



The above linear model coincides with the optimal conditional expectation model when X,Y are Gaussian.
Michel Grabisch, in Modeling Data by the Choquet Integral
(liberally edited)




Check them out.

Here are thirty homoskedastic ones:

> homo.wiener <- array(0, c(100, 30))
> for (j in 1:30) {
  for (i in 2:length(homo.wiener)) {
          homo.wiener[i,j] <-  homo.wiener[ i - 1, j] + rnorm(1)
                     }}

> for (j in 1:30) {

       plot( homo.wiener[,j], 
          type = "l", col = rgb(.1,.1,.1,.6),
          ylab="", xlab="", ylim=c(-25,25)
            );
             par(new=TRUE)

 

Here’s just the meat of that wiener, in case the for loops or window dressing were confusing.

homo.wiener[i] <-  homo.wiener[ i - 1] + rnorm(1)

 

I also made you some heteroskedastic wieners.

> same for-loop encasing. ∀ j make wieners; ∀j plot wieners
> hetero.wiener[i] <- hetero.wiener[ i-1 ] + rnorm(1, sd=rpois(1,1) )




 

It wasn’t even that hard — here are some autoregressive(1) wieners as well.

> same for-loop encasing. j make wieners; ∀j plot wieners
> ar.wiener[i] <- ar.wiener[i-1]*.9 + rnorm(1)

 

Other types of wieners:

  • a.wiener[i-1] + rnorm(1) * a.wiener[i-1] + rnorm(1)
  • central.limit.wiener[i-1] + sum( runif(17, min=-1) )
  • cauchy.wiener[i-1] + rcauchy(1)      #leaping lizards!

     
  • random.eruption.wiener[i-1] + rnorm(1) * random.eruption.wiener[i-1] + rnorm(1)



     
  • non.markov.wiener[i-1] + non.markov.wiener[i-2] + rnorm(1)
  • the.wiener.that.never.forgets[i] <- cumsum( the.wiener.that.never.forgets) + rnorm(1)
  • non.wiener[i] <- rnorm(1)
     
  • moving.average.3.wiener[i] <- .6 * rnorm(n=1,sd=1) + .1 * rnorm(n=1,sd=50) + .3 * rnorm(n=1, mean=-3,sd=17)
  • 2d.wiener <- array(0, c(2, 100));
    ifelse( runif(1) > .5,
         2d.wiener[1,i] <- 2d.wiener[1,i-1] + rnorm(1)
                 && 2d.wiener[2,i] <- 2d.wiener[2,i-1],
         2d.wiener[2,i] <- 2d.wiener[2,i-1] + rnorm(1)
                 && 2d.wiener[ 1,i] <- 2d.wiener[1,i-1]


     
  • 131d.wiener <- array(0, c( 131, 100 )); ....
  • cross.pollinated.wiener
  • contrasting sd=1,2,3 of homo.wieners
     
 

What really stands out in writing about these wieners after playing around with them, is that logically interesting wieners don’t always make for visually interesting wieners.

There are lots of games you can play with these wieners. Some of my favourites are:

  • trying to make the wieners look like stock prices (I thought sqrt(rcauchy(1)) errors with a little autocorrelation looked pretty good)
  • trying to make them look like heart monitors

Also it’s pretty hard to tell which wieners are interesting just from looking at the codes above. I guess you will just have to go mess around with some wieners yourself. Some of them will surprise you and not do anything; that’s instructive as well. 

 

VOICE OF GOD: WHAT’S UP. I AM THAT I AM. I DECLARE THAT THE WORD ‘WIENER’ IS OBJECTIVELY FUNNY. THAT’S ALL FOR NOW. SEE YOU WEDNESDAY THE 17TH.




My interpretation of [Leland Wilkinson’s] grammar [of statistical graphics]:

Data is the most important thing, and the thing that you bring to the table.

—Geometric objects … what you actually see on the plot: points, lines, polygons, etc.

Statistics transform the data in many useful ways. For example, binning and counting to create a histogram….

—Scales map values in the data space to values in an aesthetic space, whether it be colour, or size, or shape. Scales also provide an inverse mapping: a legend.

—A coordinate system describes how data coordinates are mapped to the plane of the graphic. It also provides axes and gridlines to make it possible to read the graph.

— A facetting, or conditioning, speci

Hadley Wickham

(Source: had.co.nz)




Complex systems are ones with a large effective number of strongly-interdependent variables.

This excludes both low-dimensional systems, and high-dimensional ones where the variables are either independent, or so strongly coupled that only a few variables effectively determine all the rest.
Cosma Rohilla Shalizi

(Source: stat.cmu.edu)




what are we to make of those statistically-minded disciplines currently obsessed by the search for ever more complicated models, to be summarized in terms of endless tables of coefficients and other statistical paraphernalia, and often with scarcely a graph in sight, whether of original data, model results or model diagnostics?
Leland Wilkinson, The Grammar of Graphics

(Source: jstatsoft.org)




 
Null hypothesis testing is voodoo.

Changes in the mental state of the experimenter should not affect the objective inference of the experiment. An argument for using Bayesian data analysis instead of H0 vs Ha.

Imagine you have a scintillating hypothesis about the effect of some different treatments on a metric dependent variable. You collect some data (carefully insulated from your hopes about differences between groups) and compute a t statistic for two of the groups. The computer program, that tells you the value of t, also tells you the value of p, which is the probability of getting that t by chance from the null hypothesis.
You want the p value to be less than 5%, so that you can reject the null hypothesis and declare that your observed effect is significant.
What is wrong with that procedure? Notice the seemingly innocuous step from t to p. The p value, on which your entire claim to significance rests, is conjured by the computer program with an assumption about your intentions when you ran the experiment. The computer assumes you intended, in advance, to fix the sample sizes in the groups.
In a little more detail, and this is important to understand, the computer figures out the probability that your t value could have occurred from the null hypothesis if the intended experiment was replicated many, many times. The null hypothesis sets the two underlying populations as normal populations with identical means and variances. If your data happen to have six scores per group, then, in every simulated replication of the experiment, the computer randomly samples exactly six data values from each underlying population, and computes the t value for that random sample. Usually t is nearly zero, because the sample comes from a null hypothesis population in which there is zero difference between groups. By chance, however, sometimes the sample t value will be fairly far above or below zero. The computer does a bizillion simulated replications of the experiment. The top panel of Figure 1 shows a histogram of the bizillion t values. According to the decision policy of NHST, we decide that the null hypothesis is rejectable by an actually observed tobs value if the probability that the null hypothesis generates a value as extreme or more is very small, say p &lt; 0.05. The arrow in Figure 1 marks the critical value tcrit at which the probability of getting a t value more extreme is 5%. We reject the null hypothesis if tobs &gt; tcrit In this case, when N = 6 is fixed for both groups, tcrit = 2.23. This is the critical value shown in standard textbook t tables, for a two-tailed t-test with 10 degrees of freedom.
In computing p, the computer assumes that you did not intend to collect data for some time period and then stop; you did not intend to collect more or less data based on an analysis of the early results; you did not intend to have any lost data replaced by additional collection. Moreover, you did not intend to run any other conditions ever again, or compare your data with any other conditions. If you had any of these other intentions, or if the analyst believes you had any of these other intentions, the p value can change dramatically.
 
AUTHOR: John Kruschke. The Road to Null Hypothesis Testing is Paved with Good Intentions.

Null hypothesis testing is voodoo.

Changes in the mental state of the experimenter should not affect the objective inference of the experiment. An argument for using Bayesian data analysis instead of H0 vs Ha.

Imagine you have a scintillating hypothesis about the effect of some different treatments on a metric dependent variable. You collect some data (carefully insulated from your hopes about differences between groups) and compute a t statistic for two of the groups. The computer program, that tells you the value of t, also tells you the value of p, which is the probability of getting that t by chance from the null hypothesis.

You want the p value to be less than 5%, so that you can reject the null hypothesis and declare that your observed effect is significant.

What is wrong with that procedure? Notice the seemingly innocuous step from t to p. The p value, on which your entire claim to significance rests, is conjured by the computer program with an assumption about your intentions when you ran the experiment. The computer assumes you intended, in advance, to fix the sample sizes in the groups.

In a little more detail, and this is important to understand, the computer figures out the probability that your t value could have occurred from the null hypothesis if the intended experiment was replicated many, many times. The null hypothesis sets the two underlying populations as normal populations with identical means and variances. If your data happen to have six scores per group, then, in every simulated replication of the experiment, the computer randomly samples exactly six data values from each underlying population, and computes the t value for that random sample. Usually t is nearly zero, because the sample comes from a null hypothesis population in which there is zero difference between groups. By chance, however, sometimes the sample t value will be fairly far above or below zero. The computer does a bizillion simulated replications of the experiment. The top panel of Figure 1 shows a histogram of the bizillion t values. According to the decision policy of NHST, we decide that the null hypothesis is rejectable by an actually observed tobs value if the probability that the null hypothesis generates a value as extreme or more is very small, say p < 0.05. The arrow in Figure 1 marks the critical value tcrit at which the probability of getting a t value more extreme is 5%. We reject the null hypothesis if tobs > tcrit In this case, when N = 6 is fixed for both groups, tcrit = 2.23. This is the critical value shown in standard textbook t tables, for a two-tailed t-test with 10 degrees of freedom.

In computing p, the computer assumes that you did not intend to collect data for some time period and then stop; you did not intend to collect more or less data based on an analysis of the early results; you did not intend to have any lost data replaced by additional collection. Moreover, you did not intend to run any other conditions ever again, or compare your data with any other conditions. If you had any of these other intentions, or if the analyst believes you had any of these other intentions, the p value can change dramatically.

AUTHOR: John Kruschke. The Road to Null Hypothesis Testing is Paved with Good Intentions.




A “truly” random, uniform random, completely random sequence might look like

◯◯⨯◯⨯⨯⨯⨯◯◯⨯◯◯⨯⨯◯⨯◯◯⨯⨯◯⨯⨯◯⨯◯◯⨯◯
R code: > xooooo = sample( c("◯", "⨯") , 30, rep = T) 

like the flips of a fair coin. But there are other “random”s as well.

Biased

For example, biased random, like an unfair coin with 4/5 bias, might generate a sequence that looks like this:

◯◯◯◯⨯◯◯⨯◯◯⨯◯◯◯◯◯◯◯◯◯◯◯◯◯◯◯◯◯◯◯

R code: > xooooo = sample( c("◯","◯","◯","◯", "⨯") , 30, rep = T)

 

Self-Correlated

But there’s also autocorrelated, or serially correlated, randomness.

◯◯◯◯◯◯◯◯◯◯◯◯◯◯◯◯◯◯◯◯⨯⨯⨯◯◯◯◯◯◯◯

For example you feel fine ◯ 80% of the time and 20% you’re sick ⨯ — and of course the sick days are more likely to come one after another. Or 80% of the time you don’t smoke ◯ but then you buy a pack and all of a sudden you smoke ⨯⨯⨯ three days in a row. Once you’ve broken your resolve, you’re more likely to smoke again the next day.

 

Equation-wise, autocorrelation amounts to adding a self-lag term to the other explanatory variables (plus unexplained residual). Besides habit and viral invasion, autocorrelation brings many things under the penumbra of randomness:

  • income. The strong gets more, while the weak ones fade. If you made a lot of money at your previous job, your next employer will pay you more either to steal you away or simply because salary history determines compensation in HR’s formula.
  • unemployment. Jobless today, jobless tomorrow. Those who are unemployed for more than six months are even more likely to be unemployed for the long term. Also people who take care of their own kids as their job are likely to still be doing so next week and next year rather than working for a company.
  • likelihood of cancer. Back to the subject of smoking, your likelihood of getting cancer accumulates faster and faster the more you smoke. I’ve seen claims that there is a kink in the cumulative propensity to cancer rate above one pack / day.
  • stock prices. Stocks don’t just jump around in a Cauchy distribution, although maybe the daily change in stock price does. Daily change is a lag term  so that’s serial correlation.

Serial correlation or autocorrelation refers to things that bunch together. When it rains, it pours.