Quantcast

Posts tagged with statistics

It takes ~20 observations to verify your first significant digit of the mean with confidence.

Do you know how many observations it takes to verify your first sig-fig of the variance? More like 1000. And that’s just to get one digit of accuracy! Higher moments (skew, kurtosis) are even worse.

That’s why I often laugh out loud when I read in the newspaper claims that rely on a certain value of the variance. Even in serious, published papers!—I often see tables with estimates of standard deviation that go out to three decimal places, just because the software spat the numbers out that way. It gives a false sense of accuracy. It’s ridiculous.
Karen Kafadar




I feel vindicated in several ways by the Netflix Engineering team’s recent blog post explaining what they did with the results of the Netflix Prize. What they wrote confirms what I’ve been saying about recommendations as well as my experience designing recommendation engines for clients, in several ways:

  1. Fancy ML techniques don’t matter so much. The winning BellKor/Pragmatic Chaos teams implemented ensemble methods with something like 112 techniques smushed together. You know how many of those the Netflix team implemented? Exactly two: RBM’s and SVD.

    If you’re a would-be internet entrepreneur and your idea relies on some ML but you can’t afford a quant to do the stuff for you, this is good news. Forget learning every cranny of research like Pseudo-Markovian Multibagged Quantile Dark Latent Forests! You can watch an hour-long video on OCW by Gilbert Strang which explains SVD and two hour-long Google Tech Talks by Geoff Hinton on RBM’s. RBM’s are basically a superior subset of neural network with a theoretical basis why it’s superior. SVD is a dimension reduction technique from linear algebra. (There are many Science / Nature papers on dimension reduction in biology; if you don’t have a licence there are paper-request fora on Reddit.)

    Not that I don’t love reading about awesome techniques, or that something other than SVD isn’t sometimes appropriate. (In fact using the right technique on the right portion of the problem is valuable.) What Netflix people are telling us is that, in terms of a Kaggleistic one-shot on the monolithic data set, the diminishing marginal improvements to accuracy from a mega-ensemble algo don’t count as useful knowledge.


  2. Domain knowledge trumps statistical sophistication. This has always been the case in the recommendation engines I’ve done for clients. We spend most of our time trying to understand the space of your customers’ preferences — the cells, the topology, the metric, common-sense bounds, and so on. You can OO program these characteristics. And (see bottom) doing so seems to improve the ML result a lot.

    Another reason you’re probably safe ignoring the bleeding edge of ML research is that most papers develop general techniques, test them on famous data sets, and don’t make use of domain-specific knowledge. You want a specific technique that’s going to work with your customers, not a no-free-lunch-but-optimal-according-to-X academic algorithm. Some Googlers did a sentiment-analysis paper on exactly this topic: all of the text analysis papers they had looked at chose not to optimise on specific characteristics (like keywords or text patterns) known to anyone familiar with restaurant-review data. They were able to achieve a superior solution to that particular problem without fancy new maths, only using common sense and exploration specific to their chosen domain (restaurant reviews).



  3. What you measure matters more than what you squeeze out of the data. The reason I don’t like* Kaggle is that it’s all about squeezing more juice out of existing data. What Netflix has come to understand is that it’s more important to phrase the question differently. The one-to-five-star paradigm is not going to accurately assess their customers’ attitudes toward movies. The similarity space is more like Dr Hinton’s reference to a ten-dimensional library where neighbourhood relationships don’t just go along a Dewey Decimal line but also style, mood, season, director, actors, cinematography, and yes the “People like you” metric (“collaborative filtering”, a spangled bit of jargon).

    For them the preferences evolve fairly quickly over time. That has to make it hard. If your users’ preferences evolve over time: good luck, it may be quite hard.

    John Wilder Tukey: “To statisticians, hubris should mean the kind of pride that fosters an inflated idea of one’s powers and thereby keeps one from being more than marginally helpful to others. … The feeling of “Give me (or more likely even, give my assistant) the data, and I will tell you what the real answer is!” is one we must all fight against again and again, and yet again.” via John D Cook 

Relatedly, a friend of mine who’s doing a Ph.D. in complexity (modularity in Bayesian networks) has been reading the Kaggle fora from time to time. His observation of the Kaggle winners is that they usually win with gross assumptions about either the generating process or the underlying domain. Basically they limit the ML search using common sense and data exploration; that gives them a significant boost in performance (1−AUC).

* I admire @antgoldbloom for following through on his idea and I do think they have a positive impact on the world. Which is much better than the typical “Someone should make X, that would be a great business” or even worse but still typical: “I’ve been saying they should have that!” Still, I do hold to my one point of critique: there’s no back-and-forth in Kaggle’s optimisation.




Lawrence Krauss, author of A Universe from Nothing lecturing on cosmology.

  • Don’t really agree with or like his monolithic straw-man representation of “religion” versus “science” at minute 6. “Religion pretends to know all the answers” .


    Sub-i, sub-j, larry. There are many religions and many sciences.
  • Minute 14. Edwin Hubble’s original data! straight-line plot through a bunch of dispersed points. “That’s why we know he was a great scientist” — nobody laughed in the tape, but I did — “he knew that he should draw a straight line through a cloud of points”. I also love it when people take the time to go through an old paper, pull things out, and present them anew.
  • I have never understood the business of standard candles. To me it seems like you have two degrees of freedom (distance and brightness), only one of which can be knocked out by the measurement of apparent brightness.

    So say we figure out a “standard candle” — a star with a particular colour signature that tells us “The star is at X phase of its life, is made up of Z, and such stars always shine at a constant brightness of 1 for Q million years.”

    But still — how do we know that our theory is right? How do we know, know, know that  it’s really brightness of 1? It’s not like we can triangulate. And it’s certainly not like we’ve been there and seen it first-hand.
  • I had the same problem in a discussion with a geologist a few months ago. I sometimes get the sense that working scientists are so immersed in the practical fact that, yes, for all intents and purposes we know X to be true, that they’re not willing to step back to an abstract, philosophical level and say: “Well, if you really keep pulling on the threads, there are assumptions at the bottom of everything, so yes, we really don’t absolutely know X to be the case. However, Philosophical Prig, we don’t really know we’re not living in The Matrix either! So hush up and get back to doing something relevant.” But that’s the kind of answer I really want to hear: no, we don’t know know know, but for all practical purposes, yes we know.
  • Minute 15. How old is the universe? So Hubble got the answer wrong in 1929, and it was obviously wrong. “Scientists don’t know what they’re doing”

    But I had the same reaction to people talking about dark matter in the 90’s. “What is this stuff we call dark matter? Or dark energy?” As I understood it at the time, “dark matter” just represented a 90% fudge factor in astronomical measurements. It could be that gravity or quarks or anything else about the laws of physics is simply different in other parts of the universe. And how would we rule out that hypothesis? We just rule it out by assuming that the laws of Nature are the same everywhere, because that’s what we’ve assumed for the last few hundred years and it’s always worked out. Straight-line extrapolation to “That assumption must be true now and everywhere” despite that we’re now talking about multiple galaxies so unimaginably far away.
  • Minute 18:30 “This is a Hubble plot, much better than Hubble’s plot. It was made after the discovery that on a log-log plot, everything is a straight line.” Again, no laughs, but I thought that was hilarious.
  • Calculations that estimate the total energy in all vacuums add up to 10^28 times the observed mass of the universe. Whoops.
  • Dark matter here on Earth? Let’s go down into the mines and measure it. (By the way, where would the physicists be if those evil resource-extraction companies in Lead, South Dakota hadn’t negotiated with the legal entities that be and drilled into the Earth’s crust? Way to play it as it lies, Sandia Labs. #scruples)
  • Flat, closed, or open universe? (also why are these the only three options?) Well, we only observe 30% of the mass thta would be required to make the universe flat.
  • A gigantic, gigantic, um, really gigantic triangle — to measure the curvature of the universe.
  • That’s what those microwave-background radiation detecting balloons in Antarctica have been doing.
  • There’s always something there, even when there’s nothing. (see this video of the quantum fields flickering about in empty space)
  • 90% of the mass of a proton is due to the vacuum. (not delta spikes, more like 1/x or exp(−x) integrals.) Therefore your mass is 90% due to quantum fluctuations around the zero point energy.
  • The universe also has a net total energy of 0. Hence the possibility of “a universe from nothing” (our universe needn’t have a Creator since there is enough mass/energy in the physical vacuum that those virtual fluctuations could have acted as a Prime Mover).
  • 70% + 30% = 100%
  • Making our place in the Universe even less special. “Regular” matter—the stuff we observe—is only a 1% pollution in the uniform dark-energy / dark-matter background of the universe.
  • Deep-future scientists (like in a few billion years) won’t be able to observe other galaxies. Measuring the universe, they will observe (correctly) that their galaxy is the only one around, and that there is nothing but empty, eternal space around them.
  • So they will be “Lonely and ignorant, but dominant. Of course those of us who live in the United States are already used to that.”







It is never in good taste to express the sum of two quantities as

  • 1+1=2.

[Everyone] is aware that

and further that
  • 1=sin²q+cos²q

In addition, it is obvious to the casual reader that

  • .
Therefore equation (1) can be rewritten more scientifically as:
  • .

by John Siegfried in the Journal of Political Economy. Hat tip: @unlearningecon

(Source: twitter.com)




Mostly in finance we assume that we have the equivalent of a standard dice. That is, while we assume we don’t know what number will come up next, we think that we know the distribution of numbers perfectly.


In fact the real situation is much more akin to throwing a dice where we have imperfect knowledge of what numbers are on the faces. They might be 1 to 6; but they also might be 1 to 5 with the 1 repeated; or 2 to 7; or something else entirely.


Worse, the numbers are changed by the malevolent hand of chance on a regular basis.


Not so often that we know nothing about the distribution, but often enough that we cannot be sure that the current market will be like the past.

David Murphy

(And I would add: sometimes the numbers on the die are being changed not by the malevolent hand of chance, but by the malevolent hand of a market participant who is smarter than you and can siphon your profits into their bank account.)

(Source: blog.rivast.com)




As nice as it is to be able to assume normality, … there are problems. The most obvious problem is that we could be wrong.


One … very nice thing … is that, in many situations, … [being wrong] won’t send us immediately to jail without passing “Go.” Under a … broad set of conditions … our assumption [could be wrong, yet we] get away with it. By this I mean that our answer may still be correct even if our assumption is false. This is what we mean when we speak of a [statistic] … being robust.



However, this still leaves at least two problems. In the first place, it is not hard to create reasonable data that violate a normality (or homogeneity of variance) assumption and have “true” answers that are quite different from the answer we would get by making a normality assumption. In other words, we can’t always get away with violating assumptions. Second, there are many situations where even with normality, we don’t know enough about the statistic we are using to draw the appropriate inferences.



One way to look at bootstrap procedures is as procedures for handling data when we are not willing to make assumptions about the parameters of the populations from which we sampled. The most that we are willing to assume (and it is an absolutely critical assumption) is that the data we have are a reasonable representation of the population from which they came. We then resample from the pool of data that we have, and draw inferences about the corresponding population and its parameters.

The second way to look at bootstrap procedures is to think of them as what we use when we don’t know enough.

David Howell

(Source: uvm.edu)




[I]n the late 1920’s and early 1930’s…. There were lots of deep thoughts [in economics], but a lack of quantitative results. … It is usually not of very great practical or even scientific interest to know whether the [causal] influence [of some factor] is positive or negative, if one does not know anything about the strength.


But much worse is the situation when an [outcome] is determined by many different factors at the same time, some factors working in one direction, others in the opposite directions. One could write long papers about so-called tendencies explaining how this … might work…. But what is the … total net effect of all the factors? This question cannot be answered without measures of … strength….

Trygve Haavelmo

Bank of Sweden pseudo-Dynamite Prize Laureate 1989, for work in econometrics

(Source: nobelprize.org)




Thank you, steel manufacturing companies, and thank you, chemical processing companies, for giving us the time to read. —Hans Rosling

Totally good point about how the mechanisation of the rich world has allowed us to have so many professors, doctors, photographers, lawyers, and social media managers.

 

But I wonder: why is laundry so important?

There has to be a good reason; no one working with their hands for 70+ hours a week would choose to do an extra 10 hours of labour a week if they could avoid it. But I know from experience that, in my world, if you don’t do laundry for months at a time, nothing bad happens to you.

What did I do instead of laundry? I’ve taken a few options, some of which would have been available to poor humans now or in the past:

  1. wash clothing with the excess soapy water that falls off me in the shower (not available to them)
  2. turn clothing inside out and leave it outside (requires a lot of socks but before the 19th century no one was wearing socks anyway)

The second you would think poor people could do pretty easily. I used my porch, which got sun and wind and blew away, over time, most of the smells

So what’s the reason they couldn’t do that? I have a few theories.

  • They laboured with their bodies, getting much sweatier than I do at my computer.
  • Bugs and germs were more prevalent in their environment and got in their clothing if it weren’t soaped — or at least exposed to ammonia rising off the castle pissing grounds.
  • They got dirtier, muddier, muckier. But why would you need to deal with that?
  • Having clean clothes raised your appeal to the opposite sex, and social status went along with that as it goes along with attractiveness today. Clean isn’t necessary; it’s just sexy (on average).

Anyway, I wonder if it isn’t the other changes to the modern OECD environment (reduction in bugs and reduction in manual labour) that made for the progress. Nowadays I just use the washer when I’ve exercised or played in the mud.

If the wash was always just a way of keeping up with the Joneses, however, then we can’t congratulate the washing machine for saving us necessary labour — it just helps us live out our autocompetitive rank obsessions in other ways now the elbow’s been surpassed on that dimension.




“There is more difference within the sexes than between them.”
‒Ivy Compton-Bennett, Mother and Son

“In all of human biology, there is no greater difference than of that between men and women.”
—Some biology notes I found online

These two statements sound like rhetorical opposites, but in fact both are true.

(Says me. I can’t prove this, but I bet that taking everything into consideration, divisions between men & women are greater than those between liberals & conservatives, blacks & non-blacks, tall & short, sick & well, D&D players and people who get laid, etc.)

Let me show how both statements can logically live together harmoniously.

Just like how most men are slower than female Olympians, but at the same time the average man is faster than the average woman.

NB: Not real data.

Measurement

Even when differences are statistically significant enough to draw conclusions (such as: “boys sprint faster than girls”), the magnitude may be really small so that the difference, while indisputable, is also unimportant. (“Statistical significance” is a confusing term in this respect.)

Consider that there are many ways you could measure differences among people. Here are some that come up frequently in the gender wars, grouped suggestively:

  • height, weight, curvature
  •   IQ, SAT scores, reading tests
  • speed, throwing distance, fine motor skills
  • communication skills, emotional intelligence
  • went to college, profession is engineer
  • finding things in the refrigerator, ability to focus, ability to multitask

There are many ways to measure each of these “dimensions”. For example, does “speed” mean in the 100m dash, 200m dash, marathon, trail running, bike race, or triathlon? While the answers wouldn’t be independent, they wouldn’t be one-to-one either.

A billion points in a million-dimensional space

Now you are faced with 6.7 billion points in an N-dimensional space, where N is the number of things you could measure. Let’s say like a billion points in a million-dimensional space. (Some dimensions may be collinear.)

On the one hand, there are always lots of pink and blue dots mixing in with each other (e.g. men who sew better than most women)‒and directly from Ivy’s point, the distance among pinks (variation among men) is greater than the distance from the pink centroid to the blue centroid (variation between men and women).

At the same time, though, if you had to choose just one factor by which to color these dots and get maximal classification power, it would have to be gender.

In other words, gender differences may generate a maximally separating hyperplane, but Euclidean distances between differently-gendered points are often small, and Euclidean distances between same-gendered points are often large.




data from the US Drug Enforcement Agency’s System To Retrieve Information on Drug Evidence

A few points about these pictures which I’ll be elaborating on in future posts:

  • sub i, sub j: There is significant variation from city to city and presumably dealer to dealer or customer to customer, since they plot interquartile range.
  • 3-D data: Since both purity and quantity affect the price, we’re really talking about a “price surface” — just like a volatility surface or the yield curve on Treasurys. And in fact there are even more dimensions to the data since it could be cut differently, and … well, I won’t say what makes for good coke.
  • data collection: Do you really believe these numbers? Some undercover cop probably solicited drugs (I didn’t read the methodology section but just guessing). Does that seem like an error-free data collection process? But the same goes for macroeconomic data, financial data from companies, and so on. It comes from somewhere, it’s not “the truth” necessarily.










Upon my return [to academia, after years of private statistical consulting], I started reading the Annals of Statistics … and was bemused. Every article started with:


Assume that the data are generated by the following model…


followed by mathematics exploring inference, hypothesis testing, and asymptotics…. I [have a] very low … opinion … of the theory published in the Annals of Statistics. [S]tatistics [is] a science that deals with data.

The linear regression model led to many erroneous conclusions that appeared in journal articles waving the 5% significance level without knowing whether the model fit the data. Nowadays, I think most statisticians will agree that this is a suspect way to arrive at conclusions.

In the mid-1980s … A new research community … sprang up. Their goal was predictive accuracy….. They began working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.

The advances in methodology and increases in predictive accuracy since the mid-1980s that have occurred in the research of machine learning has been phenomenal…. What has been learned? The three lessons that seem most important:

  • Rashomon: the multiplicity of good models;
  •           • Occam: the conflict between simplicity and accuracy;
  •           • Bellman: dimensionality — blessing or curse

Leo Breiman, The Two Cultures of Statistics (2001)

(which are: machine learning / artificial intelligence / algorithmists —vs— model builders / statistics / econometrics / psychometrics)




What happens if, instead of doing a linear regression with sums of monomial terms, you do the complete opposite? Instead of regressing the phenomenon against  , you regressed the phenomenon against an explanation like  ?

I first thought of this question several years ago whilst living with my sister. She’s a complex person. If I asked her how her day went, and wanted to predict her answer with an equation, I definitely couldn’t use linearly separable terms. That would mean that, if one aspect of her day went well and the other aspect went poorly, the two would even out. Not the case for her. One or two things could totally swing her day all-the-way-to-good or all-the-way-to-bad.

The pattern of her moods and emotional affect has nothing to do with irrationality or moodiness. She’s just an intricate person with a complex utility function.

If you don’t know my sister, you can pick up the point from this well-known stereotype about the difference between men and women:

a well-known stereotype: women are complex, men are simple

“Men are simple, women are complex.” Think about a stereotypical teenage girl describing what made her upset. “It’s not any one thing, it’s everything.”

I.e., nonseparable interaction terms.

I wonder if there’s a mapping that sensibly inverts strongly-interdependent polynomials with monomials — interchanging interdependent equations with separable ones. If so, that could invert our notions of a parsimonious model.

Who says that a model that’s short to write in one particular space or parameterisation is the best one? or the simplest? Some things are better understood when you consider everything at once.




It is a common mistake of inexperienced statisticians to plunge into a complex analysis without paying attention to what the objectives are or to even whether the data are appropriate to the proposed analysis. Look before you leap!

Julian James Faraway, Linear Models with R




When people pontificate about national politics, I find the dialogue too generalistic.

These discussions ignore most of the interesting variation and lose touch with real places. And, certain facts that are obvious if you’re familiar with the more specific numbers seem “miraculous” when you just hear one nation-level statistic. (Tax statistics are one such.)

Consider the US unemployment rate, for example. Not only does that figure make it sound like the same 9.5% are unemployed — not true, it’s just an aggregate of all hirings & firings and business openings & business closings — but the unemployment rate in Dane, WI, doesn’t really affect me, because I live in Monroe, IN. If I see some really, really, really compelling place — like Travis, TX — I might uproot my entire life and thenceforth be affected by the data in Travis, TX. And a nearby, culturally good place like Louisville is relevant. I moved to Louisville for a while for a job. But mostly, I need to focus on improving the economy in Monroe, IN.

I remember very well, when I was running my first business, reading grim economic news about the rest of the country. Mall-dwelling retard businesses, national franchises leveraged on the assumption that all of their new franchisees will face good economic conditions … they were affected by the national statistics, but not me. The newspapers kept shouting about how bad things were and I didn’t see it at all.

 

I think if people were primed by reading a table like this before engaging in debates, a lot fewer overly-generalistic ideas would be floated. Looking at regional variation puts me in a frame of mind that’s more specific, more sub_i sub_j, in touch with data and out of touch with theory.

N America is too big for any one’s imagination. Europe is too big for any one’s imagination. Africa is too big for any one’s imagination. China is too big for any one’s imagination. India is too big for any one’s imagination. Theory makes the world seem small, which is necessary to be able to comprehend huge topics. But Theory can make you overconfident. Data humble you.

The question

  • How will policy X create green jobs in Monroe County? in Travis County? in Lancaster County?

gets my gears running very differently than the question

  • “How will policy X create green jobs?”

. Importantly, the first question is more bullsh~t-proof. Even though logically a “Create green jobs” type of claim should be evaluated as the sum total of all green jobs created in every county.

Third number from the right is weekly income.

Table 1. Covered(1) establishments, employment, and wages in the 323 largest counties,
first quarter 2011(2)
                                                                                                       
                                                                                                       
County	                        Average weekly wage
United States(6).........	935
	
San Juan, PR.............	598
Peoria, IL...............	944
Santa Clara, CA..........	1863
Macomb, MI...............	941
Clayton, GA..............	844
Wayne, MI................	1021
Brazoria, TX.............	922
Saginaw, MI..............	760
Stark, OH................	703
Butler, PA...............	799
New York, NY.............	2634
Hartford, CT.............	1260
Fulton, GA...............	1370
Washington, PA...........	867
Snohomish, WA............	968
Genesee, MI..............	742
Fort Bend, TX............	979
Jefferson, TX............	920
Forsyth, NC..............	891
Montgomery, TX...........	886
Hennepin, MN.............	1197
Harris, TX...............	1258
Weld, CO.................	776
Winnebago, IL............	769
Oakland, MI..............	1019
Catawba, NC..............	692
Cuyahoga, OH.............	953
Middlesex, MA............	1370
Mecklenburg, NC..........	1231
Marin, CA................	1103
San Diego, CA............	1003
Worcester, MA............	908
Anoka, MN................	829
Milwaukee, WI............	929
Douglas, CO..............	1069
San Francisco, CA........	1723
Lorain, OH...............	750
Sedgwick, KS.............	816
Caddo, LA................	736
Washington, OR...........	1120
Erie, PA.................	695
Cass, ND.................	765
Whatcom, WA..............	745
Los Angeles, CA..........	1046
Hamilton, IN.............	924
Benton, AR...............	1110
Howard, MD...............	1141
Somerset, NJ.............	1867
Bexar, TX................	838
Contra Costa, CA.........	1210
Nueces, TX...............	748
New Castle, DE...........	1194
Bristol, MA..............	791
Essex, MA................	955
Henrico, VA..............	1027
Ramsey, MN...............	1093
Dane, WI.................	878
Scott, IA................	725
Ottawa, MI...............	714
Westmoreland, PA.........	716
De Kalb, GA..............	992
Fayette, KY..............	811
Ingham, MI...............	879
Travis, TX...............	1002
Tuscaloosa, AL...........	778
Muscogee, GA.............	749
Frederick, MD............	904
Hillsborough, NH.........	975
Lucas, OH................	793
Charleston, SC...........	774
Cook, IL.................	1145
Collin, TX...............	1075
Virginia Beach City, VA..	717
Fairfield, CT............	1888
Vanderburgh, IN..........	729
Rockingham, NH...........	857
Camden, NJ...............	903
Lake, IN.................	791
St. Louis, MN............	722
King, WA.................	1185
Pulaski, AR..............	819
Oklahoma, OK.............	837
Elkhart, IN..............	698
Larimer, CO..............	795
Mercer, NJ...............	1283
Multnomah, OR............	918
Allegheny, PA............	997
Greenville, SC...........	770
Dallas, TX...............	1156
Maricopa, AZ.............	889
Sacramento, CA...........	1025
Santa Barbara, CA........	869
Tulsa, OK................	825
Kanawha, WV..............	797
Denver, CO...............	1212
Will, IL.................	793
Plymouth, MA.............	815
Suffolk, MA..............	1625
Kalamazoo, MI............	816
Jefferson, AL............	919
Ada, ID..................	773
Polk, IA.................	940
Minnehaha, SD............	748
Shelby, TN...............	915
Richmond City, VA........	1071
Calcasieu, LA............	768
Cumberland, ME...........	835
Buncombe, NC.............	676
Guilford, NC.............	802
Webb, TX.................	590
Benton, WA...............	959
Mobile, AL...............	741
New Haven, CT............	956
New London, CT...........	960
Lafayette, LA............	847
Lancaster, PA............	734
Washington, AR...........	726
Greene, MO...............	661
Yellowstone, MT..........	721
Middlesex, NJ............	1191
Erie, NY.................	794
Mahoning, OH.............	632
Dauphin, PA..............	889
Northampton, PA..........	791
Spokane, WA..............	751
Placer, CA...............	876
Hillsborough, FL.........	880
McHenry, IL..............	727
Harford, MD..............	844
Barnstable, MA...........	759
Norfolk, MA..............	1066
Essex, NJ................	1229
Broome, NY...............	703
Philadelphia, PA.........	1079
Madison, AL..............	978
Ventura, CA..............	964
Orange, FL...............	805
Palm Beach, FL...........	886
Wyandotte, KS............	826
Franklin, OH.............	920
Williamson, TN...........	1054
Galveston, TX............	827
Fairfax, VA..............	1479
Lee, FL..................	711
Shawnee, KS..............	751
Onondaga, NY.............	831
Newport News City, VA....	826
Clark, WA................	800
Pima, AZ.................	768
Kern, CA.................	790
Escambia, FL.............	690
Queens, NY...............	844
Suffolk, NY..............	972
Cumberland, NC...........	695
New Hanover, NC..........	741
Chesapeake City, VA......	724
Brown, WI................	803
Montgomery, AL...........	764
Adams, CO................	806
Collier, FL..............	767
Oneida, NY...............	708
Hamilton, OH.............	992
Luzerne, PA..............	684
Bell, TX.................	736
Chesterfield, VA.........	830
Alameda, CA..............	1183
Cobb, GA.................	962
Allen, IN................	747
Berks, PA................	780
Lexington, SC............	650
Boulder, CO..............	1050
Polk, FL.................	668
Chatham, GA..............	752
Richmond, GA.............	743
Linn, IA.................	847
Montgomery, MD...........	1311
Hinds, MS................	778
Denton, TX...............	780
Outagamie, WI............	747
Waukesha, WI.............	902
Lehigh, PA...............	879
Smith, TX................	739
Salt Lake, UT............	856
Jefferson, CO............	929
Baltimore City, MD.......	1081
Cumberland, PA...........	815
Delaware, PA.............	1003
Utah, UT.................	681
Manatee, FL..............	668
Marion, IN...............	987
Jefferson, LA............	831
Dakota, MN...............	895
St. Louis, MO............	973
Lancaster, NE............	711
Richmond, NY.............	758
Lake, OH.................	774
Norfolk City, VA.........	861
Alachua, FL..............	730
Burlington, NJ...........	957
York, PA.................	789
Fresno, CA...............	709
Sonoma, CA...............	846
Miami-Dade, FL...........	874
Gwinnett, GA.............	879
Du Page, IL..............	1076
Sangamon, IL.............	907
Jefferson, KY............	873
Kent, MI.................	792
Olmsted, MN..............	968
Washoe, NV...............	789
Monroe, NY...............	847
Clackamas, OR............	798
Lane, OR.................	672
Orange, CA...............	1035
San Bernardino, CA.......	754
Nassau, NY...............	1015
Montgomery, OH...........	782
El Paso, TX..............	626
Tarrant, TX..............	900
Riverside, CA............	748
San Joaquin, CA..........	752
Broward, FL..............	834
Ocean, NJ................	746
Bronx, NY................	818
Davidson, TN.............	927
Hidalgo, TX..............	556
Duval, FL................	891
Seminole, FL.............	735
Honolulu, HI.............	821
St. Joseph, IN...........	723
Boone, MO................	692
Douglas, NE..............	853
Passaic, NJ..............	921
Bucks, PA................	855
Richland, SC.............	794
Chittenden, VT...........	878
Orleans, LA..............	983
Knox, TN.................	750
Brazos, TX...............	659
Cameron, TX..............	546
McLennan, TX.............	727
Pierce, WA...............	821
El Paso, CO..............	812
Champaign, IL............	750
Albany, NY...............	937
Chester, PA..............	1164
Lackawanna, PA...........	665
Horry, SC................	534
Tulare, CA...............	622
Lake, FL.................	586
Marion, FL...............	614
Pasco, FL................	596
Pinellas, FL.............	765
Volusia, FL..............	629
Kane, IL.................	777
East Baton Rouge, LA.....	831
St. Louis City, MO.......	1037
Atlantic, NJ.............	772
Bergen, NJ...............	1152
Lubbock, TX..............	653
Solano, CA...............	921
Arapahoe, CO.............	1130
Monmouth, NJ.............	945
Jackson, OR..............	644
Anchorage Borough, AK....	958
Bernalillo, NM...........	781
Rockland, NY.............	991
Spartanburg, SC..........	761
Stanislaus, CA...........	748
Bibb, GA.................	699
Johnson, KS..............	955
Morris, NJ...............	1462
Washington, DC...........	1540
Sarasota, FL.............	722
Clay, MO.................	850
Weber, UT................	642
Baltimore, MD............	920
Providence, RI...........	895
Davis, UT................	704
Brevard, FL..............	801
Stearns, MN..............	700
Orange, NY...............	755
Summit, OH...............	841
Yakima, WA...............	606
Winnebago, WI............	831
San Luis Obispo, CA......	742
Santa Cruz, CA...........	814
McLean, IL...............	904
Madison, IL..............	738
Prince Georges, MD.......	933
Montgomery, PA...........	1198
Rutherford, TN...........	771
Loudoun, VA..............	1093
St. Clair, IL............	709
Union, NJ................	1199
Wake, NC.................	917
Marion, OR...............	699
Clark, NV................	790
Dutchess, NY.............	917
Kitsap, WA...............	798
Harrison, MS.............	668
Monterey, CA.............	808
San Mateo, CA............	1485
Jackson, MO..............	894
St. Charles, MO..........	744
Westchester, NY..........	1332
Prince William, VA.......	808
Washtenaw, MI............	925
Gloucester, NJ...........	766
Kings, NY................	725
Leon, FL.................	722
Hampden, MA..............	812
Thurston, WA.............	800
Arlington, VA............	1549
Butler, OH...............	781
Hamilton, TN.............	785
Durham, NC...............	1276
Hudson, NJ...............	1509
Williamson, TX...........	953
Yolo, CA.................	892
Lake, IL.................	1230
Anne Arundel, MD.........	958
Alexandria City, VA......	1226 

Data notes:

  • There’s a lot of variation in number of counties per American state. For example, Indiana (36k sq mi) has 92 counties whilst Massachusetts (10 k sq mi) has 14.
  • Also, this is only private employers which skews some of the Maryland and Virginia numbers.
  • Also, this is a look at employed people, and it doesn’t count benefits.

Some raw-data observations:

  • average income in New York County is $2,600/week but only $800/week in the Bronx.
  • San Francisco and Arlington, VA are about $1000/week less than New York County.
  • Incomes in Indianapolis (Marion County) are a joke on a national scale. Even if you include people in Carmel (Hamilton County) it’s still less than $1000/week. I thought all of those Lilly people made a tidy bundle; I guess they’re too few to bring up the average.
  • I should ddply this data.
  • There seem to be a lot of $600’s $700’s $800’s. That basically checks out with median household income of $51k. Although households can comprise two individual incomes.