Posts tagged with probability distributions

## ±∞

The Cauchy distribution (`?dcauchy` in `R`) nails a flashlight over the number line

and swings it at a constant speed from 9 o’clock down to 6 o’clock over to 3 o’clock. (Or the other direction, from 3→6→9.) Then counts how much light shone on each number.

In other words we want to map evenly from `the circle (minus the top point)` onto `the line`. Two of the most basic, yet topologically distinct shapes related together.

You’ve probably heard of a mapping that does something close enough to this: it’s called `tan`.

Since `tan` is so familiar it’s implemented in Excel, which means you can simulate draws from a Cauchy distribution in a spreadsheet. Make a column of `=RAND()`'s (say column A) and then pipe them through `tan`. For example `B1=TAN(A1)`. You could even do `=TAN(RAND())` as your only column. That’s not quite it; you need to stretch and shift the `[0,1]` domain of `=RAND()` so it matches `[−π,+π]` like the circle. So really the long formula (if you didn’t break it into separate columns) would be `=TAN( PI() * (RAND()−.5) )`. A stretch and a shift and you’ve matched the domains up. There’s your Cauchy draw.

In R one could draw three Cauchy’s with `rcauchy(3)` or with `tan(2*(runif(3)−.5))`.

What’s happening at `tan(−3π/2)` and `tan(π/2)`? The `tan` function is putting out to ±∞.

I saw this in school and didn’t know what to make of it—I don’t think I had any further interest than finishing my problem set.

I saw as well the ±∞ in the output of `flip[x]= 1/x`.

• `1/−.0000...001 → −∞` whereas` 1/.0000...0001 → +∞`.

It’s not immediately clear in the `flip[x]` example but in `tan[x/2]` what’s definitely going on is that the angle is circling around the top of the circle (the hole in the top) and the flashlight of the Cauchy distribution could be pointing to the right or to the left at a parallel above the line.

Why not just call this ±∞ the same thing? “Projective infinity”, or, the hole in the top of the circle.

## How Not To Draw a Probability Distribution

If I google for “probability distribution” I find the following extremely bad picture:

It’s bad because it conflates ideas and oversimplifies how variable probability distributions can generally be.

• Most distributions are not unimodal.
• Most distributions are not symmetric.
• Most distributions do not have `mean` = `median` = `mode`.
• Most distributions are not Gaussian, Poisson, binomial, or anything famous at all.

• If this is the example you give to your students of “a distribution”, why in the world would they be surprised at the Central Limit Theorem? The reason it’s interesting is that things that don’t look like the above, sum to look like the above.
• People already mistakenly assume that everything is bell curved. Don’t reinforce the notion!
` `

Here is a better picture to use in exposition. In `R` I defined

`bimodal <- function(x) { 3 * dnorm(x, mean=0, sd=1)   +   dnorm(x, mean=3, sd=.3) / 4                         }.`

That’s what you see here, plotted with `plot( bimodal, -3, 5, lwd=3, col="#333333", yaxt="n" )`.

Here’s how I calculated the mean, median, and mode:

• mean is the most familiar  $\large \dpi{200} \bg_white \text{mean} = \int_{-\infty}^{\infty} x \cdot \mathrm{prob}(x) \ dx$. To calculate this in `R` I defined `bimodal.x <- function(x) { x * 3 * dnorm(x, mean=0, sd=1)   +   x * dnorm(x, mean=3, sd=.3) / 4  }` and did `integrate(bimodal.x, lower=-Inf, upper=Inf)`.

(You’re supposed to notice that `bimodal.x` is defined exactly the same as `bimodal` above but times `•x`.)

The output is `.75`, that’s the mean.
• mode is the x where the highest point is. That’s obviously zero. In fancy scary notation one writes “the argument of the highest probability” $\large \dpi{200} \bg_white \text{mode} \equiv \arg \max_x\{ \quad \text{prob}(x) \quad \}$
• median is the most useful but also the hardest one to write the formulaic definition. Median has 50% of the observations to the left and 50% of the observations to the right. So $\large \dpi{200} \bg_white \int_{-\infty}^{\text{median}} \mathrm{prob}(x) \ \ =\ \ \int_{\text{median}}^{\infty} \mathrm{prob}(x)$
In `R` so I had to plug in lots of values to `integrate( bimodal, lower = -Inf, upper = ... )` and `integrate( bimodal, upper = Inf, lower = ...)` until I got them to be equal. I could have been a little smarter and tried to make the difference equal zero but the way I did it made sense and was quick enough.

The answer is roughly `.12`.
```> integrate( bimodal, lower = -Inf, upper = .12 )
1.643275 with absolute error < 1.8e-08
> integrate( bimodal, upper = Inf, lower = .12 )
1.606725 with absolute error < 0.0000027```

(I could have even found the exact value using a solver. But I felt lazy, please excuse me.)

Notice that I drew the numbers as vertical lines rather than points on the curve. And I eliminated the vertical axis labels. That’s because the mean, median, and mode are all x values and have nothing whatever to do with the vertical value. If I could have figured out how to draw a coloured dot at the bottom, I would have. You could also argue that I should have shown more humps or made the mean and median diverge even more.

Here’s how I drew the above:

```png("some bimodal dist.png")
leg.text <- c("mean", "median", "mode")
leg.col <- c("red", "purple", "turquoise")
par(lwd=3, col="#333333")
plot( bimodal, -5, 5, main = "Some distribution", yaxt="n" )
abline(v = 0, col = "turquoise")
abline(v = .12, col = "purple")
abline(v = .75, col = "red")
legend(x = "topright", legend = leg.text, fill = leg.col, border="white", bty="n", cex = 2, text.col = "#666666")
dev.off() ```

Lastly, it’s not that hard in the computer era to get an actual distribution drawn from facts. The `nlme` package has actually recorded heights of boys from Oxford:

```require(nlme); data(Oxboys);
plot( density( Oxboys\$height), main = "height of boys from Oxford", yaxt="n", lwd=3, col="#333333")
```

and boom:

or in histogram form with `ggplot`, run `require(ggplot2); qplot( data = Oxboys, x = height )` and get:

the heights look Gaussian-ish, without mistakenly giving students the impression that real-world data follows perfect bell-shaped patterns.

## Sub i, sub j

I hope I can say this in a way that makes sense.

One kind of mathematical symbology your eyes eventually get used to is the Σum over all individuals” concept:

$\dpi{200} \bg_white \begin{matrix} \displaystyle \sum_{i=0}^N \ x_i \\ \\ \displaystyle \sum_i \ x_i \\ \\ \displaystyle \sum_{\text{interesting set}} x_i \\ \\ \displaystyle \sum_i \ \mathrm{weight}_i \cdot x_i \\ \\ \displaystyle \sum_i \sum_j x_{i,j} \\ \\ \displaystyle {1 \over n} \sqrt{ \sum_i (x_i - \bar{x})^2} \end{matrix}$

Yes, at first it’s painful, but eventually it looks no more confusing than the oddly-bent Phoenician-alphabet letters that make up English text.

I believe there is a generally worthwhile pearl of thought-magic wrapped up in the sub-i, sub-j pattern. I mean, a metaphor that’s good for non-mathematicians to introduce into their mind-swamps.

That pearl is a certain connection between the specific and the general—a way of reaching valid generality, without dropping the specificity.

` `

Before I explain it, let me talk a bit more about the formalism and how it’s used. I’ll introduce one word: functional. A functional ƒ maps from a small, large, convoluted, or simple domain onto a one-dimensional codomain (range).

Examples (and non-examples) of functionals:

• size — you can measure the volume of a convex hull, the length of an N-dimensional vector, the magnitude of a complex number, the girth of a rod, the supremum of a functional, the sum of a sequence, the length of a sequence, the number of books someone has read, the breadth of books someone has read (is that one-dimensional? maybe not), the complicatedness (Vapnik-Chervonenkis dimension) of a functional, the Gini coefficient of a country’s income distribution, the GNP of a country, the personal incomes of the lowest earning 10% of a country, the placement rate of an MBA programme, the mean post-MBA income differential, the circumference of a ball, the volume of a ball, … and many other kinds of size.
• goodness / scorebusiness metrics often rank super-high-dimensional things like the behaviour of a group of team members into a total ordering of desireable through less desireable. When businesses use several different metrics (scores) then that’s not a functional but instead the concatenation of several functionals (into a function).
• utility — for homo economicus, all possible choices are totally, linearly ordered by equivalence classes of isoclines.
• fitness — all evolutionary traits (a huge, huge space) are cross-producted with an evolutionary environment to give a Fitness Within That Environment: a single score.
• angle — if “angle” has meaning (if the space is an inner product space) then angle is a one-dimensional codomain. In the abstract sense of "angle" I could be talking about correlation or … something else that doesn’t seem like a geometrical angle as normally proscribed.
• distance … or difference — Intimately related to size, distance is kind of like “size between two things”. If that makes sense.
• quantum numbers — four quantum numbers define an electron. Each number (n, l, m, spin) maps to a one-dimensional answer from a finite corpus. Some of the corpora are interrelated though, so maybe it’s not really 1-D.
• quantum operators — Actually, some quantum operators are non-examples because they return an element of Hilbert space as the answer. (like the Identity operator). But for example the Energy operator returns a unidimensional value.
• ethics — Do I need more non-examples of functionals? A complete ethical theory might return a totally rankable value for any action+context input. But I think it’s more realistic to expect an ethical theory to return a complicated return-value type since ethics hasn’t been completely figured out.
• regression analysis — You get several β's as return values, each mogrified by a t-value. So: not a one-dimensional return type.
• logic — in the propositional calculus, declarative sentences return a value from {true, false} or from {true, false, n/a, don’t know yet}. You could argue about whether the latter is one-dimensional. But in modal logic you might return a value from the codomain “list of possible worlds in which proposition is true”, which would definitely not be a 1-dimensional return type.
• factor a number — last non-example of a functional. You put in 136 and you get back {1, 2, 4, 8, 17, 34, 68, 136}. Which is 8 numbers rather than 1. (And potentially more: 1239872 has fourteen divisors or seven prime factors, whichever you want to count.)
• median — There’s no simple formula for it, but the potential answers come from a codomain of just-one-number, i.e. one parameter, i.e. one dimension.
• other descriptive statistics — interquartile range, largest member of the set (`max`), 72nd percentile, trimean, 5%-winsorised mean, … and so on, are 1-dimensional answers.
• integrals — Integrals don’t always evaluate to unidimensional, but they frequently do. “Area under a curve” has a unidimensional answer, even though the curve is infinite-dimensional. In statistics one uses marginalising integrals, which reduce the dimensionality by one. But you also see 's that represent a sequence of ∫∫∫'s reducing to a size-type answer.
• variability — Although wiggles are by no means linear, variance (2nd moment of a distribution) measures a certain kind of wiggliness in a linearly ordered, unidimensional way.
• autocorrelation — Another form of wiggliness, also characterised by just one number.
• Conditional Value-at-Risk — This formula $\large \dpi{150} \bg_white \int_{0\%}^{10\%} \mathrm{something} \cdot d \, \mathrm{something}$ is a so-called “coherent risk measure”. It’s like the expected value of the lowest decile. Also known as expected tail loss. It’s used in financial mathematics and, like most integrals, it maps to one dimension (expected £ loss).
• "the" temperature — Since air is made up of particles, and heat is to do with the motions of those particles, there are really something like 10^23 dynamical orbits that make a room warm or cold (not counting the sun’s rays). “The” temperature is some functional of those—like an average, but exactly what I don’t know.
` `

Functionals can potentially take a bunch of complicated stuff and say one concrete thing about it. For example I could take all the incomes of all the people in Manhattan, apply this functional:

$\large \dpi{200} \bg_white {1 \over N} \ \displaystyle \sum_{\text{Manhattanites}} \ \text{ income}_j$

and get the average income of Manhattan.

Obviously there is a huge amount of individual variation among Manhattan’s residents. However, by applying a functional I can get Just One Answer about which we can share a discussion. Complexity = reduced. Not eliminated, but collapsed.

I could apply other functionals to the population, like

• count the number of trust fund babies (if “trust fund baby” can be defined)
• calculate the fraction of artists (if “artist" can be defined)
• calculate the “upper tail risk” (ETL integral from 90% to 100%, which average would include Nueva York’s several billionaires)

Each answer I am getting, despite the wide variation, is a simple, one-dimensional answer. That’s the point of a functional. You don’t have to forget the profundity or specificity of individual or group variation, but you can collapse all the data onto a single, manageable scale (for a time).

` `

The payoff

The sub-i sub-j pattern allows you to think about something both specifically and in general, at once.

1. Each individual is counted uniquely. The description of each individual (in terms of the parameter/s) is unique.
2. Yet there is a well-defined, actual generalisation to be made as well. (Or multiple generalisations if the codomain is multi-dimensional.) These are valid generalisations. If you combine together many such generalisations (median, 95th percentile, 5th percentile, interquartile range) then you can quickly get a decent description of the whole.

Kind of like how thinking with probability distributions can help you avoid stereotypes: you can understand the distinctions between

• the mean 100m sprint time of all men is faster than the mean 100m sprint time of all women
• the medians are rather close, perhaps identical
• the top 10% of women run faster than the bottom 80% of men
• the variance of male sprint times is greater than the variance of female sprint times
• differences in higher moments, should they exist
• the CVaR's of the distributions are probably equivalent
• conditional distributions (sub-divisions of sprint times) measured of old men; age 30-42 black women; age 35 Caribbean-born women of any race of non-US nationality who live in the state of Alabama
• and so on.

It becomes harder to sustain sexism, racism, and to sustain stereotypes of all sorts. It becomes harder to entertain generalistic, simplistic, model-driven, data-less economic thinking.

• For instance, the unemployment rate is the collapse/sum of ∀ lengths of individual unemployment spells: ∫ (`length of unemp`) • (`# of people w/ that unemp length`) = ∫ `d`x • ƒ(x).

Like the dynamic vapor pressure of a warm liquid in a closed container, where different molecules are pushing around in the gas and alternately returning to the soup. The total pressure looks like a constant, but that doesn’t mean the same molecules are gaseous—nor does it mean the same people are unemployed.

(So, for example, knowing that the unemployment rate is higher doesn’t tell you whether there are a few more long-term unemployed people, a lot more short-term unemployed people, or a mix.)
• You can generalise about a group using different functionals. The average wealth (`mean` functional) of an African-American Estadounidense is lower than the average wealth of a German-American Estadounidense, but that doesn’t mean there aren’t wealthy AA’s (`max` functional) or poor GA’s (`min` functional).
• You don’t have to collapse all the data into just one statistic.

You can also collapse the data into groups, for example collapsing workers into groups based on their industry.

(here the vertical axis = number of Estadounidenses employed in a particular industry — so the collapse is done differently at each time point)

Various facts about Venn Diagrams, calculus, and measure theory constrain the possible logic of these situations. It becomes tempting to start talking about underlying models, variation along a dimension, and “the real causes" of things. Which is fun.

At the same time, it becomes harder to conceive overly simplistic statements like “Kentuckians are poorer than New Yorkers”. Which Kentuckians do you mean? And which New Yorkers? Are you saying the median Kentuckian is poorer than the median New Yorker? Or perhaps that dollar cutoff for the bottom 70% of Kentuckians are poorer than the cutoff to the bottom 50% of New Yorkers? I’m sorry, but there’s too much variation among KY’s and NY’s for the statement to make sense without a more specific functional mapping from the two domains of the people in the states onto a dollar figure.

` `

ADDED: This still isn’t clear enough. A friend read this piece and gave me some helpful feedback. I think maybe what I need to do is explain what the sub-i, sub-j pattern protects against. It protects against making stupid generalisations.

To be clear: in mathematics, a generalisation is good. A general result applies very broadly, and, like the more specific cases, it’s true. Since I talk about both mathematical speech and regular speech here, this might be confusing. But: in mathematics, a generalisation is just as true as the original idea but just applies in more cases. Hence is more likely to apply to real life, more likely to connect to other ideas within mathematics, etc. But as everyone knows, people who “make generalisations” in regular speech are usually getting it wrong.

Here are some stupid generalisations I’ve found on the Web.

• Newt Gingrich: "College students are lazy."
Is that so? I bet that only some college students are lazy.

Maybe you could say something true like “The total number of hours studied divided by total number of students (a functional ℝ⁺^{# students}→ℝ⁺) is lower than it was a generation ago.” That’s true. But look at the quantiles, man! Are there still the same number of studious kids but only more slackers have enrolled? Or do 95% of kids study less? Is it at certain schools? Because I think U Chicago kids are still tearing their hair out and banging their heads against the wall.
• Do heterodox economists straw-man mainstream economics?
I’m sure there are some who do and some who don’t.
• The bad economy is keeping me unemployed.
That’s foul reasoning. A high general unemployment rate says nothing directly about your sector or your personal skills. It’s a spatial average. Anyway, you should look at the length of personal unemployment spells for
• Conservatives say X. Liberals say Y. Libertarians think Z.
Probably not ∀ conservatives say X. Nor ∀ liberals say Y. Nor do ∀ libertarians think Z. Do 70% of liberals say Y? Now that I’m asking you to put numbers to the question, that should make you think about defining who is a liberal and measuring what they say. Not only listening to the other side, but quantifying what they say. Are you so sure that 99% of libertarians think Z now?
• The United States needs to focus on creating high-tech jobs.
Are you actually just talking about opportunities for upper-middle-class people in Travis County, TX and Marin County, CA? Or does your idea really apply to Tuscaloosa, Flint, Plano, Des Moines, Bemidji, Twin Falls, Lawrence, Tempe, Provo, Cleveland, Shreveport, and Jacksonville?
• Green jobs are the future!
For whom?
• Alaskans are enslaved to oil companies.
• Meat eaters, environmentalists, blacks, hipsters, … you can find something negative said about almost any group.
Without quantification or specificity, it will almost always be false. With quantification, one must become aware of the atoms that make up a whole—that the unique atoms may clump into natural subgroups; that variation may derive from other associations—that the true story of a group is always richer and more interesting than the imagined stereotypes and mental shorthand.
• What’s wrong with the teenage mind? WSJ.
a teenage mind?
• French women eat rich food without getting fat. Book.
• French parents are better than American parents. WSJ.
• What is it about twenty-somethings? NY Times.

If you sub-i, sub-j these statements, you can come up with a more accurate and productive sentence that could move disagreeing parties forward in a conversation.

Unwarranted generalisations are like Star Trek: portraying an entire race as being defined by exactly one personality trait (“Klingons are warlike”, “Ferengi’s ony care about money”). That sucks. The sub-i, sub-j way is more like Jack Kerouac’s On the Road: observing and experiencing individuals for who they are. That’s the way.

If you want to make true generalisations—well, you’re totally allowed to use a functional. That means the generalisations you make are valid—limited, not overbearing, not reading too much into things, not railroading individuals who contradict your idea in service of your all-important thesis.

OK, maybe I’ve found it: a good explanation of what I’m trying to say. There are valid ways to generalise about groups and there are invalid ways. Invalid is making sweeping over-generalisations that aren’t true. Sub-i, sub-j generalisations are true to the subject while still moving beyond “Everyone is different”.

## Irrationality in Economics, and “Subjective Probability”

I gave this talk several years ago, but you know what? It’s still pretty decent.

The title is misleading. Like many of my titles, it’s meant to grab attention rather than be exactly correct.

I was trying, with this talk, to convince college freshmen to switch from Philosophy to Economics. And you know, Philosophers are always talking about Rationality — is there even such a thing, and if so what does it consist of? Econ provides more than one concrete prescription for Rationality — more on that below.

` `

“We are recorders and reporters of the facts—not judges of the behavior we describe.” —Alfred Kinsey

I actually think that economists and psychologists could do more to prescribe healthy, effective behaviours and thought-strategies for people to follow. But the recommendations should be based on empirics, e.g.

• "buy experiential goods, not durable goods";
• "purchase with cash instead of plastic";
• "beware these 4 common investing mistakes made by novices";
• put crisps and fudge in a drawer, not in plain sight”

—not on a general model of “optimal” behaviour.

Theorists, though, don’t have the necessary understanding to make normative evaluations. Not yet, at least. But they can approach the deep Utility Theory questions in the spirit of the above quotation. They can model behaviours and thoughts, and inquire as to how they are internally structured — without the prejudice of inherited mathematical aesthetics.

What do I mean by ‘inherited aesthetics’ ? One example is substituting the mathematics of probability for a separate theory of human figuring.

` `

I SHOULD HAVE SAID IT LIKE THIS IN THE SLIDES

One parsimonious shortcut economists tried, which didn’t work out, was to use probability mathematics to explain how people think about the future. If we can conceive of people’s beliefs as mathematical probabilities, then regular microeconomics + more maths = a new, better theory of behaviour.

For example, curved preferences over wealth would manifest themselves in probabilistic situations such as lotteries, insurance, betting, investing, employment in risky jobs, and love & sex risks.

But. People don’t think that way. They don’t make accurate calculations about Poisson distributions, Beta distributions, Bayesian priors, Aumann agreement theorems, and so on. I guess evolution either built us for something different or else we’re just misshapen clay with limited resources to Bayes our way to rationality.

I speculate that the way people think about probability — dubbed “subjective probability” by Leonard Savage — is shaped very differently from what mathematicians usually consider “natural” axioms — transitivity, commutativity, reflexivity, independence of irrelevant alternatives, monotonicity, and so on. But who knows? The correct theory doesn’t exist yet.

` `

NOT ACTUALLY IRRATIONAL

The word “irrationality” I definitely ab-used.

Economists come up with a theory of how people behave and say it’s “ideal” or “rational”. People don’t actually think like that, so then we say they’re “irrational”? That doesn’t make sense. The theory was just wrong; an incorrect description. They perform sub-optimally according to some guy’s theory of the world, of their value system, and of how they should think. But since we don’t really know how people really think, how they experience the results of their choices, or how we should evaluate discrepant self-reports of how good a decision was, we can’t say what’s rational.

Like so, although it took the Ellsberg Paradox, Allais Paradox, and other results to disprove the accepted theory which naïvely united Probability and Utility, those results are not the point. The point is that we have to conceive a more realistic model of people’s mental models before Economics can draw valid conclusions about what people “should” do.