Quantcast

Posts tagged with data

It takes ~20 observations to verify your first significant digit of the mean with confidence.

Do you know how many observations it takes to verify your first sig-fig of the variance? More like 1000. And that’s just to get one digit of accuracy! Higher moments (skew, kurtosis) are even worse.

That’s why I often laugh out loud when I read in the newspaper claims that rely on a certain value of the variance. Even in serious, published papers!—I often see tables with estimates of standard deviation that go out to three decimal places, just because the software spat the numbers out that way. It gives a false sense of accuracy. It’s ridiculous.
Karen Kafadar




Since people liked my last opinion piece on #big data, here’s another one.

Imagine there was a technology that allowed me to record the position of every atom in a small room, thereby generating some ridiculous amount of data (Avogadro’s number is 𝒪(10²³)ŽŽ so some prefix around that order of magnitude — eg yoctobytes). And also imagine that there was a way for other scientists to decode and view all of that. (Maybe the latency and bandwidth can still be restricted even though neither capacity nor resolution nor fidelity nor coverage of the measurement are restricted — although that won’t be relevant to my thought experiment, it would seem “like today” where MapReduce is required.)

Let’s say I am running some behavioural economics experiment, because I like those. What fraction of the data am I going to make use of in building my model? I submit that the psychometric model might be exactly the same size as it is today. If I’m interested in decision theory then I’m going to be looking to verify/falsify some high-level hypothesis like “Expected utility” or “Hebbian learning”. The evidence for/against that idea is going to be so far above the atomic level, so far above the neuron level, I will basically still be looking at what I look at now:

  • Did the decisions they ended up making (measured by maybe 𝒪(100), maybe even 𝒪(1) numbers in a table) correspond to the theory?
  • For example if I draw out their assessment of the probability and some utility ranking then did I get them to violate that?

If I’ve recorded every atom in the room, then with some work I can get up to a coarser resolution and make myself an MRI. (Imagine working with tick-level stock data when you really are only interested in monthly price movements—but in 3-D.) (I guess I wrote myself into even more of a corner here, if we have atomic level data then it’s quantum, meaning you really have to do some work to get it to the fMRI scale!) But say I’ve gotten to fMRI level data, then what am I going to do with them? I don’t know how brains work. I could look up some theories of what lighting-up in different areas of the brain means (and what about 16-way dynamical correlations of messages passing between brain areas? I don’t think anatomy books have gotten there yet). So I would have all this fMRI data and basically not know what to do with it. I could start my next research project to look at numerically / mathematically obvious properties of this dataset, but that doesn’t seem like it would yield up a Master Answer of the Experiment because there’s no interplay beween theories of the brain and trying different experiments to test it out — I’m just looking at “one single cross section” which is my one behavioural econ experiment. Might squeeze some juice but who knows.

http://www.michaeleisen.org/blog/wp-content/uploads/2008/10/wwjp_final_bwgoldenrod.png

Then let’s talk about people critiquing my research paper. I would post all the atomic-level data online of course, because that’s what Jesus would do. But would the people arguing against my paper be able to use that granular data effectively?

I don’t really think so. I think they would look at the very high level of 𝒪(100) or 𝒪(1) data that I mentioned before, where I would be looking.

  • They might argue about my interpretation of the numbers or statistical methods.
  • They might say that what I count as evidence doesn’t really count as evidence because my reasoning was bad.
  • They couldn’t argue that the experiment isn’t replicable because I imagined a perfect-fidelity machine here.
  • They could go one or two levels deeper and find that my experimental setup was imperfect—the administrator of the questions didn’t speak the questions in exactly the same tone of voice each time; her face was at a slightly different angle; she wore a different coloured shirt on the other day. But in my imaginary world with perfect instruments, those kinds of errors would be so easy to see everywhere that nobody would take such a criticism seriously. (And of course because I am the author of this fantasy, there actually aren’t significant implementation errors in the experiment.)

Now think about either the scientists 100 years after that or if we had such perfect-fidelity recordings of some famous historical experiment. Let’s say it’s Michelson & Morley. Then it would be interesting to just watch the video from all angles (full resolution still not necessary) and learn a bit about the characters we’ve talked so much about.

But even here I don’t think what you would do is run an exploratory algorithm on the atomic level and see what it finds — even if you had a bajillion processing power so it didn’t take so long. There’s just way too much to throw away. If you had a perfect-fidelity-10²⁵-zoom-full-capacity replica of something worth observing, that resolution and fidelity would be useful to make sure you have the one key thing worth observing, not because you want to look at everything and “do an algo” to find what’s going on. Imagine you have a videotape of a murder scene, the benefit is that you’ve recorded every angle and every second, and then you zoom in on the murder weapon or the grisly act being committed or the face of the person or the tiny piece of hair they left and that one little sliver of the data space is what counts.

What would you do with infinite data? I submit that, for analysis, you’d throw most of the 10²⁵ bytes away.




Gauging the frothiness of the webby/techy/san-fran VC market.
Source: Mark Suster. Propagated via one of tumblr’s owners, who added:

Based on the NVCA statistics on the venture capital industry, there are [approximately] 1,000 early stage financings every year….
And somewhere around 50 - 100 of them exit for more than $100mm every year. So 5-10% of the companies financed by VCs end up exiting for more than $100mm.

Mathematical PS: These are value-at-risk numbers, just upside-down.

Gauging the frothiness of the webby/techy/san-fran VC market.

Source: Mark Suster. Propagated via one of tumblr’s owners, who added:

Based on the NVCA statistics on the venture capital industry, there are [approximately] 1,000 early stage financings every year….

And somewhere around 50 - 100 of them exit for more than $100mm every year. So 5-10% of the companies financed by VCs end up exiting for more than $100mm.

Mathematical PS: These are value-at-risk numbers, just upside-down.


hi-res




Big Data vs Quality Data

  • theLoneFuturist: I'm not certain why learning Hadoop isn't more attractive to you. If you are fine with R, doesn't having lots of data interest you?
  • theLoneFuturist: Don't get me wrong, there are probably unexciting tasks associated with big data, but you'd then get to run your algorithms over big data. And lack of data is an often cited problem for learning/adaptive algorithms. But of no interest to you?
  • isomorphisms: The BIG DATA fad seems to be based on "let's turn a generic algorithm loose on exabytes!"
  • isomorphisms: No matter how the data was gathered, what its underlying shape/logic is, what's left out.
  • isomorphisms: For example twitter text analysis. At a high level I might ask "How are attitudes changing?" "How do people talk about women differently than men?" "Do attitudes toward Barack Obama depend on the state of the US economy?" Questions whose answers aren't easy to turn into just a few numbers.
  • isomorphisms: My parody of a big-data faddist's response would be all the sophistication of: listen twitter | Hadoop_grep Obama | uniq -c | well_known_sentiment_analysis_algo. Hooray! Now I know how people feel about Obama! /sarcasm
  • isomorphisms: In the 'modelling vs scavenging' war (cf Leo Breiman) I'm more on the modelling side. So I find some aspects of the ML / bigdata craze unsavoury.
  • isomorphisms: But the emergence of petareams is certainly a paradigm shift. I don't think the Big Data faddists are wrong in that. That environmental difference will change things as surely as cheap computing power changed statistics. (Why learn statistical theory when you can bootstrap?) As far as the art of the possible -- more clickstreams being recorded makes more analysis doable.
  • isomorphisms: Anyway, to answer your question, no, having a lot of data doesn't interest me.
  • isomorphisms: I'd rather have interesting data than lots of it.
  • theLoneFuturist: Thing is, interesting data is probably a subset of big data. Mechanically define/separate interesting and you can get it.
  • isomorphisms: Definitely not, think about historical data.
  • isomorphisms: For example Angus Maddison's estimates of ancient incomes; the archaeological or geological record; unscanned text (like the Book of Kells, are you going to OCR an illuminated manuscript? You would miss the Celtic knots)
  • isomorphisms: Even if stuff were OCR scanned properly and no problems with tables, the interpretive work that historians do would be hard to code up in an algorithm. To me they dig up much more interesting information than the petabytes of clickstream logs.
  • isomorphisms: Or these internal documents they just found from Al-Qaeda? Which would you rather have, 100 GB of server logs or 10 kB worth of text from Osama bin Laden at a crucial moment?
  • isomorphisms: Also, we talk about text being "unstructured data", how about "I smell sulphur coming from over there" (during an archaeological dig) or "This kind of quartz shouldn't be at this depth in this part of the world" or, you know, "Hey look are those dinosaur footprints?"
  • isomorphisms: The kind of stuff a fisherman might notice. THAT'S unstructured data.
  • theLoneFuturist: Sure, though if enough historical records get scanned, they too become the dread big data. I do catch your point, though.




An unabashedly narcissistic data analysis of my own tweets.

The unequivocally lovely Jeff Gentry (@geoffjentry) has contributed an R package with easy-to-read documentation that works, which I’ll walk through here so that you, too, can gaze at your own face mirrored in the beauty of a woodland pond—er, sea of electrons.

Here’s the basic flow for grabbing stuff. You can do more with ROAuth but that’s a bit of a pain.

require(twitteR)
RT.of.me <- searchTwitter("RT @isomorphisms", n=100)
news <- getTrends(n=50)
firehose <- publicTimeline(n=999)


my.tweets <- userTimeline('isomorphisms', n=3500)
head(my.tweets$text) Consider: Donkey Kong is neither a donkey, nor a kong. William Thurston, geometrizer of manifolds http://t.co/UPwuAnbP When I invent a single-letter language, it's going to be called Ж. @theLoneFuturist True. If $GOOG were only an ad network, with no search facility, how much would it be worth? Do your arms hang down by your side in zero gravity? Because then I bet astronauts have less smelly armpits. Can't log into Hacker News with #w3m! Unexpected. Salt and sugar are opposites. Therefore if i eat too much salty food I must balance it with candy. #logic @leighblue Do you know any behavioural econ studies on utility vs bite-size / package-size?

Those are some of the ways you can grab data — twitteR hooks into RCurl and then, like, the info is just there. Run twListToDF( tweets ) to split the raw info into 10 subfields—text-of-tweet, to-whom-was-the-reply-threaded, timestamp, and more.

To pull out just one of those fields—like “source of tweet”, for example, use sapply:

my.tweets <- userTimeline('isomorphisms', n=3000)
whence.i.tweet <- sapply( my.tweets, function(x) x$statusSource

You can see from plots 1, 2, 3, and 4 that I use @floodgap’s TTYtter client (tweeting from the command line; no installation). In fact this is why I’ve started tweeting so much the last few months: I run TTYtter in a virtual terminal, mutt (command-line gmail) in another virtual terminal, and therefore it becomes quite easy to flick my virtual newsfeed/conversation stream on for a minute or two here and there whenever I’m at the computer. It feels like The Matrix or Neuromancer or something.

Here’s how I created the ggplot radial chart #4 — this was the longest command I had to use to generate any of them. For some reason qplot didn’t like scale_y_log10() so I did:

ggplot( data = data.frame(whence.i.tweet),  aes( x=factor(whence.i.tweet),  fill=factor(whence.i.tweet) )   )
 + scale_y_log10()
 + geom_bar()
 + coord_polar()
 + opts(  title="whence @isomorphisms tweets",   axis.title.x=theme_blank(),   legend.title=theme_blank()   )

In the words of @jeffreybreen, twitteR almost makes this too easy. A few months ago — before I knew about this package — I was analysing tweets for a client who wanted to gauge the effectiveness of “customer service tweets”. I wrote an ugly, hacky perl script that told me whether the tweet had an @ in it, whether the @ was a RT @an wem the tweet was @, and so on. Dealing with people using @ in another sense besides “Hey @cmastication, what’s up?” or different numbers of spaces between RT/MT; multiple RT’s in the same message; and so on — was an icky mess. I probably spent half a week changing my regexes around to deal with more cases I hadn’t thought of. Like most statisticians, I hate data munging—swimming around in the data is the fun part, not patching up the kiddie pool. Besides that, my client wanted the results in an Excel file — and Excel can’t handle multidimensional arrays (whereas a tweet mentioning @a @b @c should have just one “mentions” slot with three things in it).


That twitteR package is so hot right now.

  

But as much fun as it was to display my love of TTYtter in four different plots, that’s not the only R-based egotainment you can compute on a Friday night.

How wordy am I?

I know I am wordy. I often adopt a telegraphic SMS-like typing style (“Sntrm wd b gr8 prez, like Ahmedinejad”) rather than hold back my trenchant remarks about astronauts’ armpits. Tumblr’s auto-tweets don’t help my average, either—the default is long, and I’m usually too lazy to change it.

With the magic of kernel density estimates—which are definitely not overkill for the analysis of my appropriately-florid and highly-important charstreams—and my usual base::plot params, the length of my tweets is made art in the form of chart #5.

I got a vector of tweet-lengths using @hadleywickham’s stringr package:

my.tweets <- userTimeline('isomorphisms', n=3500)
my.tweets <- twListToDF( my.tweets ) iso <- my.tweets$text require(stringr) iso.len <- str_length(iso) #vectorised! No for loops necessary hist( iso.len, fill="cyan" )

Proving once again that all real-world distributions fit a bell curv—…um.

You can of course use subset( my.tweets ) to plot tweets that were made under certain conditions—I might look only at my tumblr auto-posts using subset( my.tweets, statusSource=="tumblr"). Or only at short tweets using subset( my.tweets, str_length(my.tweets$text)<100 ). And so on.

 

Lastly, I wanted to plot my tweeple—the people I talk to on twitter (most of whom I don’t actually know in real life … I like to keep friends and mathematical geekery separate). As you can see from the final chart, it was largely a sh_tshow. Or so I thought, until I considered attacking the problem with ggplot.

One of ggplot’s strengths—in my opinion its greatest strength—is the facet_grid( atttribute.1 ~ attribute.2) function. In combination with base::cut — which assigns discrete “levels” to the data — facetting is especially powerful. I cut my data into four subsets, based on how many times I’ve tweeted @ someone:

my.tweets <- userTimeline( 'isomorphisms', n=3000 )

# only tweets that are @ someone talkback <- subset( my.tweets,  is.na(replyToSN) == FALSE )
#the value would be NA iff I tweeted into the vast nothingness, apropos of no-one
# just the names, not the rest of the tweet's text or meta-information tweeps <- talkback$replyToSN
#make a new data frame for ggplot to facet_wrap. tweep.count <- table(tweeps) tweep.levels <- cbind( tweep.count,
cut( tweep.count, c(0,1,2,5,100) ),
rownames(tweep.count)
) tweeps <- data.frame(tweep.levels) names(tweeps) <- c("number", "category", "name") class(tweeps$number) <- "numeric"
#all the above stuff only came clear after a few attempts
#and likewise the plot didn't work out perfect at first, either!
#but here's a decent plot that works: ggplot( data = tweeps, aes(x=number) ) + facet_wrap(~ category, scale="free_x") + geom_text( aes(label=name, y=30-order(name), size=sqrt(log(number)),    col=number+(as.numeric(category))^2 ), position="jitter" ) + opts( legend.title = theme_blank(), legend.text = theme_blank() )

This made for a much more readable image. Not perfect, but definitely displaying info now.

 

OK, I do love talking about my twistory a little too much — but I’d like to see your histograms as well! If you run some stats on your own account, please post some pics below. I believe images can be directly embedded in the Disqus comments with <img src="http://i.minus.com/staggering_analysis_of_my_fantastic_words.jpg">.

(To save your R plots to a file rather than to the screen, do png("a plot named Sue.png"); plot( laa dee daa ); dev.off() where ; could be replaced by a newline.)










Cost of university attendance in California as a fraction of the median family&#8217;s yearly income, 1975-2010.
by Sean Mulcahy

Cost of university attendance in California as a fraction of the median family’s yearly income, 1975-2010.

by Sean Mulcahy


hi-res




length of time spent jobless in various American recessions
via the Economic Policy Institute
 
Relatedly, here&#8217;s Alan Krueger &amp; Andreas Mueller on reservation wages:

This paper presents findings from a survey of 6,025 unemployed workers who were interviewed every week for up to 24 weeks in the fall of 2009 and spring of 2010. Our main findings are: (1) the amount of time devoted to job search declines sharply over the spell of unemployment; (2) the self-reported reservation wage predicts whether a job offer is accepted or rejected; (3) the reservation wage is remarkably stable over the course of unemployment for most workers, with the notable exception of workers who are over age 50 and those who had nontrivial savings at the start of the study; (4) many workers who seek full-time work will accept a part-time job that offers a wage below their reservation wage; and (5) the amount of time devoted to job search and the reservation wage help predict early exits from Unemployment Insurance (UI).

via @tylercowen

length of time spent jobless in various American recessions

via the Economic Policy Institute

 

Relatedly, here’s Alan Krueger & Andreas Mueller on reservation wages:

This paper presents findings from a survey of 6,025 unemployed workers who were interviewed every week for up to 24 weeks in the fall of 2009 and spring of 2010. Our main findings are: (1) the amount of time devoted to job search declines sharply over the spell of unemployment; (2) the self-reported reservation wage predicts whether a job offer is accepted or rejected; (3) the reservation wage is remarkably stable over the course of unemployment for most workers, with the notable exception of workers who are over age 50 and those who had nontrivial savings at the start of the study; (4) many workers who seek full-time work will accept a part-time job that offers a wage below their reservation wage; and (5) the amount of time devoted to job search and the reservation wage help predict early exits from Unemployment Insurance (UI).

via @tylercowen


hi-res




While science rightly uses empirical evidence (facts) as the ultimate arbiter of truth, those who experiment and analyse field data usually only credit or discredit ideas / frameworks that some theorist has previously invented.

Science: We finally figured out that you could separate fact from superstition by a completely radical method: observation. You can try things, measure them, and see how they work! Bitches.

Tagline. Science: We finally figured out that you could separate fact from superstition by a completely radical method: observation. You can try things, measure them, and see how they work!

Hence the name “theory-killers” for experimental physicists.

 

Where do these theories come from, though? My own experience and my observations of others lead me to believe that an economic theorist’s deep creative centre is informed, flavoured, shaped, and sullied by her own personal experiences, biases, stereotypes, and assumptions about what’s normal. If you talk to people who have deeply integrated into their psyche concepts like “opportunity cost”, “rationality”, “search”, “strategy”, “information”, “evolution”, “optimisation”, and so on, and you disagree with this statement, please tell me.

(For example: a professor of game theory told me that he cannot fathom the motivations of a suicide bomber. He can’t fathom them, so he can’t model them, so we have no theory to predict and curtail their bombing behaviour.

Example 2: Do you think Daniel Ellsberg started running psychological experiments at random until he stumbled upon his famous “Ellsberg Paradox”? No, he had the idea in his head that these two kinds of “uniform distribution” should be different—perhaps getting the idea from Keynes or Frank Knight—and
 then tested the idea.)

Since there’s a “lone genius” limit on novel* economic theories, a finite upper bound follows on how much fact-checking can improve a theory’s soul. Although one can certainly benefit from pulling on threads, reading monographs, looking at data tables and so on, ultimately I believe deep insights come from the same brain process that generates the fallacy of lack-of-imagination (argumentum ad ignorantiam). Just as people form judgments by the “Does this fit with what can I imagine” test, so too—says I—do economic theories rise from the same murky pit. Personal experiences where we’ve taken in reams of high-dimensional streaming data (like at work) feed this imaginative capacity, such that we can run and assess counterfactual dramas in our heads (sort of like a Monte Carlo). “What if the vendor had said this to my boss? Nah, she wouldn’t have reacted that way. Not like her.” There are some biologists who say that our brains have an especial capability to think through such human dramas. (And in writing that sentence I used the same often fallacious imaginative faculty.) The imaginative faculty is abused by cheap stereotypes—

Ideas can be checked against experiences and personal symbols much more easily than against a tome of facts. Since theoretical creativity proceeds in inspirational flashes and needs to run verificational checks at the speed of imagination, only the checks that can be done very quickly influence the creative process.

* Of course most theories derive from the joining of ideas from the existing literature. But those aren’t “novel” ideas.

It’s my conviction, therefore, that theoretical economists would come up with better theories if they spent more time in “the real world” and less time thinking about isomorphisms.

 

The problem is more acute in economics than in physics, because economic theories are much harder to kill (so many alternative explanations / dismissals one can retreat to) — which shifts some of the burden of correctness to the theorists. If you know that

  1. a compelling idea (like “Those who spend other people’s money will be wasteful”) will be hard to falsify;
  2. it will spread memetically through influential minds;
  3. it’s important to get this right, or else the former USSR and Latin American countries (𝓞 1 billion people) will be screwed over by your idea

then you would be quite reasonable in polishing & perfecting a theory—working to cleanse yourself of biases and myopia, asking yourself if what you’re writing is really quite true, what are the underlying assumptions, and so on.

The problem is also more important for economic theorists to address because those who theorise about the human mind have, erm, direct access to the thing they’re theorising about.

The difficulty of killing an economic theory has been discussed much elsewhere:

and if you read the things I read, you’ve probably had similar thoughts as:

  • “Really? It took this long for ‘neoindustrial’ ideas like ‘The economics of serfdom differ from the economics of a modern web programmer’ to become acceptable?” And neo-industrial uses a totally neoclassical approach but just in a meta context—rational response to the incentives that come with a social framework, or perhaps game theory rationally optimising evolution rather than individuals.
  • “Really? People think you can just use a probability distribution to model a person’s or a firm’s thought-process?”
  • “Really? We just shrug off counterevidence to the theories by saying they’re only models?”
  • “Really? Real numbers and Lagrangians are underlying all of this?”
  • “Really? It’s so controversial that utility is derived from relative and not absolute wealth?”

and many others. Point being, if you are educated on this stuff, then I’m sure you can see how the Slutzsky decomposition is a compelling advance in “research technology”, but can’t carry over as-is to the ultimate subject of interest, which is human behaviour and feelings.

 

Am I just carping? A bit, but I also can propose something like a solution. If those who give out grants for economic research could be convinced that

  • business experience
  • time spent in poor countries
  • experience in a variety of economic roles outside of academia

were important indicators of future relevancy and correctness of research—along with knowledge of a body of literature, knowledge of mathematical/statistical/experimental methods, consulting/political experience, and/or a Ph.D.—then up-and-coming economists would have the incentive to spend time in “the real world” and find out, in a personal way, what the people they theorise about go through.




[I]n the late 1920’s and early 1930’s…. There were lots of deep thoughts [in economics], but a lack of quantitative results. … It is usually not of very great practical or even scientific interest to know whether the [causal] influence [of some factor] is positive or negative, if one does not know anything about the strength.


But much worse is the situation when an [outcome] is determined by many different factors at the same time, some factors working in one direction, others in the opposite directions. One could write long papers about so-called tendencies explaining how this … might work…. But what is the … total net effect of all the factors? This question cannot be answered without measures of … strength….

Trygve Haavelmo

Bank of Sweden pseudo-Dynamite Prize Laureate 1989, for work in econometrics

(Source: nobelprize.org)




“There is more difference within the sexes than between them.”
‒Ivy Compton-Bennett, Mother and Son

“In all of human biology, there is no greater difference than of that between men and women.”
—Some biology notes I found online

These two statements sound like rhetorical opposites, but in fact both are true.

(Says me. I can’t prove this, but I bet that taking everything into consideration, divisions between men & women are greater than those between liberals & conservatives, blacks & non-blacks, tall & short, sick & well, D&D players and people who get laid, etc.)

Let me show how both statements can logically live together harmoniously.

Just like how most men are slower than female Olympians, but at the same time the average man is faster than the average woman.

NB: Not real data.

Measurement

Even when differences are statistically significant enough to draw conclusions (such as: “boys sprint faster than girls”), the magnitude may be really small so that the difference, while indisputable, is also unimportant. (“Statistical significance” is a confusing term in this respect.)

Consider that there are many ways you could measure differences among people. Here are some that come up frequently in the gender wars, grouped suggestively:

  • height, weight, curvature
  •   IQ, SAT scores, reading tests
  • speed, throwing distance, fine motor skills
  • communication skills, emotional intelligence
  • went to college, profession is engineer
  • finding things in the refrigerator, ability to focus, ability to multitask

There are many ways to measure each of these “dimensions”. For example, does “speed” mean in the 100m dash, 200m dash, marathon, trail running, bike race, or triathlon? While the answers wouldn’t be independent, they wouldn’t be one-to-one either.

A billion points in a million-dimensional space

Now you are faced with 6.7 billion points in an N-dimensional space, where N is the number of things you could measure. Let’s say like a billion points in a million-dimensional space. (Some dimensions may be collinear.)

On the one hand, there are always lots of pink and blue dots mixing in with each other (e.g. men who sew better than most women)‒and directly from Ivy’s point, the distance among pinks (variation among men) is greater than the distance from the pink centroid to the blue centroid (variation between men and women).

At the same time, though, if you had to choose just one factor by which to color these dots and get maximal classification power, it would have to be gender.

In other words, gender differences may generate a maximally separating hyperplane, but Euclidean distances between differently-gendered points are often small, and Euclidean distances between same-gendered points are often large.




The ratio of US jobseekers to US jobs stands at 4:1.via John Irons (of argmax.com fame)
 
The jobs-to-seekers ratio rises immediately during a recession, but does not decrease as quickly after the recession ends. (Is this true in general?)

The 2011 jobs-to-seekers ratio, broken down by sector.

The ratio of US jobseekers to US jobs stands at 4:1.
via John Irons (of argmax.com fame)

 

The jobs-to-seekers ratio rises immediately during a recession, but does not decrease as quickly after the recession ends. (Is this true in general?)

The 2011 jobs-to-seekers ratio, broken down by sector.


hi-res




The last decade&#8217;s debt record for several rich countries.
3-month Bond Yields owed by some of them:      (SOURCE: Bloomberg)
Japan   .10%
UK      .41%
Germany .28%
US      .04%
And here&#8217;s one of the yield curves (US&#8217;):
 
(Remember, higher yield means the debt costs more to service for the country that&#8217;s borrowing.)

The last decade’s debt record for several rich countries.

3-month Bond Yields owed by some of them:      (SOURCE: Bloomberg)

Japan   .10%
UK      .41%
Germany .28%
US      .04%

And here’s one of the yield curves (US’):

 


(Remember, higher yield means the debt costs more to service for the country that’s borrowing.)


hi-res




[T]he point of introducing L^p spaces in the first place is … to exploit … Banach space. For instance, if one has |ƒ − g| = 0, one would like to conclude that ƒ = g. But because of the equivalence class in the way, one can only conclude that ƒ is equal to g almost everywhere.

The Lebesgue philosophy is analogous to the “noise-tolerant” philosophy in modern signal progressing. If one is receiving a signal (e.g. a television signal) from a noisy source (e.g. a television station in the presence of electrical interference), then any individual component of that signal (e.g. a pixel of the television image) may be corrupted. But as long as the total number of corrupted data points is negligible, one can still get a good enough idea of the image to do things like distinguish foreground from background, compute the area of an object, or the mean intensity, etc.

Terence Tao

If you’re thinking about points in Euclidean space, then yes — if the distance between them is nil, they are in the exact same spot and therefore the same point.

But abstract mathematics opens up more possibilities.

  • Like TV signals. Like 2-D images or 2-D × time video clips.
  • Like crime patterns, dinosaur paw prints, neuronal spike-trains, forged signatures, songs (1-D × time), trajectories, landscapes.
  • Like, any completenormedvector space. (= it’s thick + distance exists + addition exists + everything’s included = it’s a Banach space)

(Source: terrytao.wordpress.com)




data from the US Drug Enforcement Agency’s System To Retrieve Information on Drug Evidence

A few points about these pictures which I’ll be elaborating on in future posts:

  • sub i, sub j: There is significant variation from city to city and presumably dealer to dealer or customer to customer, since they plot interquartile range.
  • 3-D data: Since both purity and quantity affect the price, we’re really talking about a “price surface” — just like a volatility surface or the yield curve on Treasurys. And in fact there are even more dimensions to the data since it could be cut differently, and … well, I won’t say what makes for good coke.
  • data collection: Do you really believe these numbers? Some undercover cop probably solicited drugs (I didn’t read the methodology section but just guessing). Does that seem like an error-free data collection process? But the same goes for macroeconomic data, financial data from companies, and so on. It comes from somewhere, it’s not “the truth” necessarily.










When people pontificate about national politics, I find the dialogue too generalistic.

These discussions ignore most of the interesting variation and lose touch with real places. And, certain facts that are obvious if you’re familiar with the more specific numbers seem “miraculous” when you just hear one nation-level statistic. (Tax statistics are one such.)

Consider the US unemployment rate, for example. Not only does that figure make it sound like the same 9.5% are unemployed — not true, it’s just an aggregate of all hirings & firings and business openings & business closings — but the unemployment rate in Dane, WI, doesn’t really affect me, because I live in Monroe, IN. If I see some really, really, really compelling place — like Travis, TX — I might uproot my entire life and thenceforth be affected by the data in Travis, TX. And a nearby, culturally good place like Louisville is relevant. I moved to Louisville for a while for a job. But mostly, I need to focus on improving the economy in Monroe, IN.

I remember very well, when I was running my first business, reading grim economic news about the rest of the country. Mall-dwelling retard businesses, national franchises leveraged on the assumption that all of their new franchisees will face good economic conditions … they were affected by the national statistics, but not me. The newspapers kept shouting about how bad things were and I didn’t see it at all.

 

I think if people were primed by reading a table like this before engaging in debates, a lot fewer overly-generalistic ideas would be floated. Looking at regional variation puts me in a frame of mind that’s more specific, more sub_i sub_j, in touch with data and out of touch with theory.

N America is too big for any one’s imagination. Europe is too big for any one’s imagination. Africa is too big for any one’s imagination. China is too big for any one’s imagination. India is too big for any one’s imagination. Theory makes the world seem small, which is necessary to be able to comprehend huge topics. But Theory can make you overconfident. Data humble you.

The question

  • How will policy X create green jobs in Monroe County? in Travis County? in Lancaster County?

gets my gears running very differently than the question

  • “How will policy X create green jobs?”

. Importantly, the first question is more bullsh~t-proof. Even though logically a “Create green jobs” type of claim should be evaluated as the sum total of all green jobs created in every county.

Third number from the right is weekly income.

Table 1. Covered(1) establishments, employment, and wages in the 323 largest counties,
first quarter 2011(2)
                                                                                                       
                                                                                                       
County	                        Average weekly wage
United States(6).........	935
	
San Juan, PR.............	598
Peoria, IL...............	944
Santa Clara, CA..........	1863
Macomb, MI...............	941
Clayton, GA..............	844
Wayne, MI................	1021
Brazoria, TX.............	922
Saginaw, MI..............	760
Stark, OH................	703
Butler, PA...............	799
New York, NY.............	2634
Hartford, CT.............	1260
Fulton, GA...............	1370
Washington, PA...........	867
Snohomish, WA............	968
Genesee, MI..............	742
Fort Bend, TX............	979
Jefferson, TX............	920
Forsyth, NC..............	891
Montgomery, TX...........	886
Hennepin, MN.............	1197
Harris, TX...............	1258
Weld, CO.................	776
Winnebago, IL............	769
Oakland, MI..............	1019
Catawba, NC..............	692
Cuyahoga, OH.............	953
Middlesex, MA............	1370
Mecklenburg, NC..........	1231
Marin, CA................	1103
San Diego, CA............	1003
Worcester, MA............	908
Anoka, MN................	829
Milwaukee, WI............	929
Douglas, CO..............	1069
San Francisco, CA........	1723
Lorain, OH...............	750
Sedgwick, KS.............	816
Caddo, LA................	736
Washington, OR...........	1120
Erie, PA.................	695
Cass, ND.................	765
Whatcom, WA..............	745
Los Angeles, CA..........	1046
Hamilton, IN.............	924
Benton, AR...............	1110
Howard, MD...............	1141
Somerset, NJ.............	1867
Bexar, TX................	838
Contra Costa, CA.........	1210
Nueces, TX...............	748
New Castle, DE...........	1194
Bristol, MA..............	791
Essex, MA................	955
Henrico, VA..............	1027
Ramsey, MN...............	1093
Dane, WI.................	878
Scott, IA................	725
Ottawa, MI...............	714
Westmoreland, PA.........	716
De Kalb, GA..............	992
Fayette, KY..............	811
Ingham, MI...............	879
Travis, TX...............	1002
Tuscaloosa, AL...........	778
Muscogee, GA.............	749
Frederick, MD............	904
Hillsborough, NH.........	975
Lucas, OH................	793
Charleston, SC...........	774
Cook, IL.................	1145
Collin, TX...............	1075
Virginia Beach City, VA..	717
Fairfield, CT............	1888
Vanderburgh, IN..........	729
Rockingham, NH...........	857
Camden, NJ...............	903
Lake, IN.................	791
St. Louis, MN............	722
King, WA.................	1185
Pulaski, AR..............	819
Oklahoma, OK.............	837
Elkhart, IN..............	698
Larimer, CO..............	795
Mercer, NJ...............	1283
Multnomah, OR............	918
Allegheny, PA............	997
Greenville, SC...........	770
Dallas, TX...............	1156
Maricopa, AZ.............	889
Sacramento, CA...........	1025
Santa Barbara, CA........	869
Tulsa, OK................	825
Kanawha, WV..............	797
Denver, CO...............	1212
Will, IL.................	793
Plymouth, MA.............	815
Suffolk, MA..............	1625
Kalamazoo, MI............	816
Jefferson, AL............	919
Ada, ID..................	773
Polk, IA.................	940
Minnehaha, SD............	748
Shelby, TN...............	915
Richmond City, VA........	1071
Calcasieu, LA............	768
Cumberland, ME...........	835
Buncombe, NC.............	676
Guilford, NC.............	802
Webb, TX.................	590
Benton, WA...............	959
Mobile, AL...............	741
New Haven, CT............	956
New London, CT...........	960
Lafayette, LA............	847
Lancaster, PA............	734
Washington, AR...........	726
Greene, MO...............	661
Yellowstone, MT..........	721
Middlesex, NJ............	1191
Erie, NY.................	794
Mahoning, OH.............	632
Dauphin, PA..............	889
Northampton, PA..........	791
Spokane, WA..............	751
Placer, CA...............	876
Hillsborough, FL.........	880
McHenry, IL..............	727
Harford, MD..............	844
Barnstable, MA...........	759
Norfolk, MA..............	1066
Essex, NJ................	1229
Broome, NY...............	703
Philadelphia, PA.........	1079
Madison, AL..............	978
Ventura, CA..............	964
Orange, FL...............	805
Palm Beach, FL...........	886
Wyandotte, KS............	826
Franklin, OH.............	920
Williamson, TN...........	1054
Galveston, TX............	827
Fairfax, VA..............	1479
Lee, FL..................	711
Shawnee, KS..............	751
Onondaga, NY.............	831
Newport News City, VA....	826
Clark, WA................	800
Pima, AZ.................	768
Kern, CA.................	790
Escambia, FL.............	690
Queens, NY...............	844
Suffolk, NY..............	972
Cumberland, NC...........	695
New Hanover, NC..........	741
Chesapeake City, VA......	724
Brown, WI................	803
Montgomery, AL...........	764
Adams, CO................	806
Collier, FL..............	767
Oneida, NY...............	708
Hamilton, OH.............	992
Luzerne, PA..............	684
Bell, TX.................	736
Chesterfield, VA.........	830
Alameda, CA..............	1183
Cobb, GA.................	962
Allen, IN................	747
Berks, PA................	780
Lexington, SC............	650
Boulder, CO..............	1050
Polk, FL.................	668
Chatham, GA..............	752
Richmond, GA.............	743
Linn, IA.................	847
Montgomery, MD...........	1311
Hinds, MS................	778
Denton, TX...............	780
Outagamie, WI............	747
Waukesha, WI.............	902
Lehigh, PA...............	879
Smith, TX................	739
Salt Lake, UT............	856
Jefferson, CO............	929
Baltimore City, MD.......	1081
Cumberland, PA...........	815
Delaware, PA.............	1003
Utah, UT.................	681
Manatee, FL..............	668
Marion, IN...............	987
Jefferson, LA............	831
Dakota, MN...............	895
St. Louis, MO............	973
Lancaster, NE............	711
Richmond, NY.............	758
Lake, OH.................	774
Norfolk City, VA.........	861
Alachua, FL..............	730
Burlington, NJ...........	957
York, PA.................	789
Fresno, CA...............	709
Sonoma, CA...............	846
Miami-Dade, FL...........	874
Gwinnett, GA.............	879
Du Page, IL..............	1076
Sangamon, IL.............	907
Jefferson, KY............	873
Kent, MI.................	792
Olmsted, MN..............	968
Washoe, NV...............	789
Monroe, NY...............	847
Clackamas, OR............	798
Lane, OR.................	672
Orange, CA...............	1035
San Bernardino, CA.......	754
Nassau, NY...............	1015
Montgomery, OH...........	782
El Paso, TX..............	626
Tarrant, TX..............	900
Riverside, CA............	748
San Joaquin, CA..........	752
Broward, FL..............	834
Ocean, NJ................	746
Bronx, NY................	818
Davidson, TN.............	927
Hidalgo, TX..............	556
Duval, FL................	891
Seminole, FL.............	735
Honolulu, HI.............	821
St. Joseph, IN...........	723
Boone, MO................	692
Douglas, NE..............	853
Passaic, NJ..............	921
Bucks, PA................	855
Richland, SC.............	794
Chittenden, VT...........	878
Orleans, LA..............	983
Knox, TN.................	750
Brazos, TX...............	659
Cameron, TX..............	546
McLennan, TX.............	727
Pierce, WA...............	821
El Paso, CO..............	812
Champaign, IL............	750
Albany, NY...............	937
Chester, PA..............	1164
Lackawanna, PA...........	665
Horry, SC................	534
Tulare, CA...............	622
Lake, FL.................	586
Marion, FL...............	614
Pasco, FL................	596
Pinellas, FL.............	765
Volusia, FL..............	629
Kane, IL.................	777
East Baton Rouge, LA.....	831
St. Louis City, MO.......	1037
Atlantic, NJ.............	772
Bergen, NJ...............	1152
Lubbock, TX..............	653
Solano, CA...............	921
Arapahoe, CO.............	1130
Monmouth, NJ.............	945
Jackson, OR..............	644
Anchorage Borough, AK....	958
Bernalillo, NM...........	781
Rockland, NY.............	991
Spartanburg, SC..........	761
Stanislaus, CA...........	748
Bibb, GA.................	699
Johnson, KS..............	955
Morris, NJ...............	1462
Washington, DC...........	1540
Sarasota, FL.............	722
Clay, MO.................	850
Weber, UT................	642
Baltimore, MD............	920
Providence, RI...........	895
Davis, UT................	704
Brevard, FL..............	801
Stearns, MN..............	700
Orange, NY...............	755
Summit, OH...............	841
Yakima, WA...............	606
Winnebago, WI............	831
San Luis Obispo, CA......	742
Santa Cruz, CA...........	814
McLean, IL...............	904
Madison, IL..............	738
Prince Georges, MD.......	933
Montgomery, PA...........	1198
Rutherford, TN...........	771
Loudoun, VA..............	1093
St. Clair, IL............	709
Union, NJ................	1199
Wake, NC.................	917
Marion, OR...............	699
Clark, NV................	790
Dutchess, NY.............	917
Kitsap, WA...............	798
Harrison, MS.............	668
Monterey, CA.............	808
San Mateo, CA............	1485
Jackson, MO..............	894
St. Charles, MO..........	744
Westchester, NY..........	1332
Prince William, VA.......	808
Washtenaw, MI............	925
Gloucester, NJ...........	766
Kings, NY................	725
Leon, FL.................	722
Hampden, MA..............	812
Thurston, WA.............	800
Arlington, VA............	1549
Butler, OH...............	781
Hamilton, TN.............	785
Durham, NC...............	1276
Hudson, NJ...............	1509
Williamson, TX...........	953
Yolo, CA.................	892
Lake, IL.................	1230
Anne Arundel, MD.........	958
Alexandria City, VA......	1226 

Data notes:

  • There’s a lot of variation in number of counties per American state. For example, Indiana (36k sq mi) has 92 counties whilst Massachusetts (10 k sq mi) has 14.
  • Also, this is only private employers which skews some of the Maryland and Virginia numbers.
  • Also, this is a look at employed people, and it doesn’t count benefits.

Some raw-data observations:

  • average income in New York County is $2,600/week but only $800/week in the Bronx.
  • San Francisco and Arlington, VA are about $1000/week less than New York County.
  • Incomes in Indianapolis (Marion County) are a joke on a national scale. Even if you include people in Carmel (Hamilton County) it’s still less than $1000/week. I thought all of those Lilly people made a tidy bundle; I guess they’re too few to bring up the average.
  • I should ddply this data.
  • There seem to be a lot of $600’s $700’s $800’s. That basically checks out with median household income of $51k. Although households can comprise two individual incomes.