Posts tagged with statistics

## Dummyisation

Statisticians are crystal clear on human variation. They know that not everyone is the same. When they speak about groups in general terms, they know that they are reducing N-dimensional reality to a 1-dimensional single parameter.

Nevertheless, statisticians permit, in their regression models, variables that only take on one value, such as `{0,1}` for `male/female` or `{a,b,c,d}` for `married/never-married/divorced/widowed`.

No one doing this believes that all such people are the same. And anyone who’s done the least bit of data cleaning knows that there will be `NA`'s, wrongly coded cases, mistaken observations, ill-defined measures, and aberrances of other kinds. It can still be convenient to use binary or n-ary dummies to speak simply. Maybe the marriages of some people coded as `currently married` are on the rocks, and therefore they are more like `divorced`—or like a new category of people in the midst of watching their lives fall apart. Yes, we know. But what are you going to do—ask respondents to rate their marriage on a scale of one to ten? That would introduce false precision and model error, and might put respondents in such a strange mood that they answer other questions strangely. Better to just live with being wrong. Any statistician who uses the `cut` function in R knows that the variable didn’t become basketed←continuous in reality. But a `facet_wrap` plot is easier to interpret than a 3D wireframe or cloud-points plot.

To the precise mind, there’s a world of difference between saying

• "the mean height of men > the mean height of women", and saying
• "men are taller than women".

Of course one can interpret the second statement to be just a vaguer, simpler inflection of the first. But some people understand  statements like the second to mean “each man is taller than each woman”. Or, perniciously, they take “Blacks have lower IQ than Whites” to mean “every Black is mentally inferior to every White.”

I want to live somewhere between pedantry and ignorance. We can give each other a break on the precision as long as the precise idea behind the words is mutually understood.

` `

Dummyisation is different to stereotyping because:

• stereotypes deny variability in the group being discussed
• dummyisation acknowledges that it’s incorrect, before even starting
• stereotyping relies on familiar categories or groupings like skin colour
• dummyisation can be applied to any partitioning of a set, like based on height or even grouped at random

It’s the world of difference between taking on a hypotheticals for the purpose of reaching a valid conclusion, and bludgeoning someone who doesn’t accept your version of the facts.

So this is a word I want to coin (unless a better one already exists—does it?):

• dummyisation is assigning one value to a group or region
• for convenience of the present discussion,
• recognising fully that other groupings are possible
• and that, in reality, not everyone from the group is alike.
• Instead, we apply some ∞→1 function or operator on the truly variable, unknown, and variform distribution or manifold of reality, and talk about the results of that function.
• We do this knowing it’s technically wrong, as a (hopefully productive) way of mulling over the facts from different viewpoints.
• In other words, dummyisation is purposely doing something wrong for the sake of discussion.

hi-res

There’s a paper in PNAS suggesting that lots of published scientific associations are likely to be false, and that Bayesian considerations imply a p-value threshold of 0.005 instead of 0.05 would be good. It’s had an impact outside the statistical world, eg, with a post on … Ars Technica…

3. If … you think p-value thresholds should be a publishing criterion, you’ve got worse problems than reproducibility.

4. False negatives are errors, too.  People already report “there was no association between X and Y ” (or worse “there was no effect of X on Y”) in subgroups where the p-value is greater than 0.05.  If you have the same data and decrease the false positives you have to increase the false negatives.

5. The problem isn’t the threshold so much as the really weak data in a lot of research, …. Larger sample sizes or better experimental designs would actually reduce the error rate; moving the threshold only swaps which kind of error you make.

7. And finally, why is it a disaster that a single study doesn’t always reach the correct answer? Why would any reasonable person expect it to? It’s not as if we have to ignore everything except the results of that one experiment in making any decisions.

HT @zentree

Over a year ago, I wrote a letter to the editor of the Journal of Computational Sciences, urging the retraction of Bollen, Mao, and Zeng’s paper, “Twitter Mood Predicts the Stock Market.” Since JoCS is an Elsevier journal, one does not simply email the editor.

Rather, one has to register with the Elsevier author system, … submit `LaTeX` source code of a letter, along with supporting documents, author bio, .… I distilled the main arguments into two:

1. first, that the Granger causality tests presented in BMZ’s paper are … datamining, and present no evidence for a connection between Twitter and the Dow Jones Index;
2. and that the quoted predictive accuracy of the forecast model is so high, it would … [contradict] the experiences of … [traders] … and so this forecast accuracy is likely to be erroneously reported.
I included references to BMZ’s failed attempts to commercialize their patented techniques with Derwent.

Following the strictest protocol, the editor of JoCS duly sent this letter to reviewers . After roughly seven months, …

The reviewers’ comments were more than fair. If my arguments were unclear, I was more than happy to reword them and provide additional evidence to get my point across. So I edited my letter to the editor, and re-sent it. …

…within two months or so (the equivalent of overnight in journal-time), the editor sent me a rejection notice with … review, quoted below. This review—this review is sensational. As one afflicted with Hamlet Syndrome, I admire Reviewer #4’s conviction. As someone too often in search of the right phrase to dismiss a crap idea, I take delight in Reviewer #4’s acid pen: I have never seen a reviewer so viciously shit-can a paper before. Reviewer #4 tore my letter to pieces, then burned the pieces. Then poured lye on the ashes. Then salted the earth where the lye sizzled. Then burnt down the surrounding forest, etc.

Fun coursera on virology.

• Viruses are so numerous (10³⁰) and filling up everywhere. It gives this Boltzmann flavour of ‘enough stuff” to really do statistics on.

• Viruses are just a bundle of `{proteins, lipids, nucleic acids}` with a shell. It’s totally value-free, no social Darwinism or “survival of the fittest” being imbued with a moral colour. Just a thing that happened that can replicate.
• Maybe this is just because I was reading about nuclear spaces (⊂ topological vector spaceand white-noise processes that I think of this. Viruses have a qualitatively different error structure than Gaussian. Instead of white-noise it’s about if they can get past certain barriers, like:
• survive out in the air/water/cyanide
• bind to a DNA
• adapt to the host’s defences
• … it seems like a mathematician or probabilist could use the viral world of errors to set out different assumptions for a mathematical object that would capture the broad features of this world that’s full of really tiny things but very different to gas particles.
• Did I mention that I love how viral evolution is totally value-neutral and logic-based?
• Did I mention how I love that these things are everywhere all the time, filling up the great microspace my knowledge had left empty between man > animals > plants > > bacteria > > minerals?

World record progression for the men’s Long Jump.

A jump process.

PUN FULLY INTENDED.

(Source: Wikipedia)

hi-res

i.e. think about the number of rearrangements of `AAAAABBBBBBB` and then revalue `A as +` and `B as −` — or whatever.

Requires knowing that combinatorics can be thought of in terms of counting injections, bijections, etc. rather than real-life examples like cards or coin flips.

hi-res

I argued that `CVaR` (expected shortfall) of personal income is a better indicator of a society’s success than is GDP.

$\dpi{200} \bg_white \large \mathtt{\ CVaR} \overset{\mathrm{def}}= \int_{\mathtt{lo}}^{\mathtt{hi}}\mathtt{value} \cdot \mathtt{probability}$

`CVaR` combines the basic statistical operations of

• subsetting and
• averaging.

In statistical analysis of the middle it’s useful to winsorise—trim off the upper and lower `X%` and look at those separately. With `CVaR` it’s almost the opposite: look at the upper or lower edge only. (Although you could also look at only the bottom 50% which is not really an edge.)

You could also use the same technique to look at the “top” rather than the “bottom”. Think about, for example, the apparent puzzle of

• rising life expectancy, with
• stagnant longevity.

Average lifespans rise as early causes of death (dysentery, childbirth, violence) decline.

But death by “natural causes” (getting old and all your body systems start to fail | telomere cutoffs | whatever “natural causes” means; it’s sort of vague) doesn’t get postponed by as much.

I can think of three ways to even go about defining what arithmetic we’re going to perform on the data to answer “Is longevity higher or lower?”.

1. Perform a lot of subset operations on cause-of-death. Remove the violent ones, the childbirth ones, the cancer ones that also coincide with old age but not middle-age but maybe middle-old-age should count…, the narcotic ones (but not narcotics that are used for euthanasia in old people), the driving accidents, the young suicides, the wild animal attacks, the malaria, the starvation, the tuberculosis, the ebola, ….
2. Perform just one subset operation on age. Pick some age like 70 over which you will consider all deaths to be “of old age”, even if they got hit by a car. Average all those ages together and the number you’re using now cuts out—roughly, not surgically—a lot of the deaths you aren’t interested in.

Just like subsetting to age at death `WHERE age > 5` will pull out childhood illness deaths.
3. Consider the upper 10% or 20% or 50% of ages at death. Average that together and now the number you’re comparing reasonable numbers across countries.

This last one is the `CVaR` approach. Clearly all three have flaws. But the third one needs the least data and the least data janitorship (imagine languages or different fields/columns or different coding choices).

Just like using lower `CVaR` to compare only poor people’s incomes, if we used upper `CVaR` to compare only old people’s death ages, we’d get better numbers and talk more sense with only a bit more effort.

Years ago, manufacturers could build a sequence of prototypes and use these to discover and rectify any problems. But now competitive pressures [have reduced] the time to bring a vehicle to market…. Automotive manufacturers aim to … design … a new vehicle and the manufacturing facility … in an entirely virtual world.

This speeds the introduction of the new product, but it does mean that designers … aim to anticipate … problems before a physical build of the vehicle is completed or a new production facility is built. Experience [from] the past is useful, but new vehicles have new features…. For these reasons, we need models that predict how humans of different types will behave in vehicle and workplace environments.

I love Julian James Faraway's reasoning process at the beginning of his paper on ergonomic simulation. He starts out by addressing the most important question: why should I care? rather than assuming “STEM is useful” or “Mathematics is good by fiat”.

Instead of saying that some bit of maths is "important" because “important” is an adjective and he felt like putting an adjective there, Dr. Faraway explains why mathematics is relevant to this specific problem which people already care about.

• Because of the production constraints, the automobile manufacturers need to figure this out on computer before building and testing something in reality.
• Because we don’t have infinite money to build a lot of test space programmes, we have to calculate exactly the trajectories and rocket pulse timing beforehand.
• Because the Aswan dam is so hugely expensive, we need to mathematically plan how it should work before making it.

And so on. It suggests that the practical application of mathematics is in areas where prototyping is prohibitively expensive.

Or where prediction is necessary. For example, actuaries predict large-scale (i.e., central limit theorem applies) insurance losses before they happen.