Posts tagged with SVD

## Rank-Nullity Theorem

The rank-nullity theorem in linear algebra says that dimensions either get

• thrown in the trash
• or show up

after the mapping.

By “the trash” I mean the origin—that black hole of linear algebra, the `/dev/null`, the ultimate crisscross paper shredder, the ashpile, the wormhole to void and cancelled oblivion; that country from whose bourn no traveller ever returns.

The way I think about rank-nullity is this. I start out with all my dimensions lined up—separated, independent, not touching each other, not mixing with each other. `|||||||||||||` like columns in an Excel table. I can think of the dimensions as separable, countable entities like this whenever it’s possible to rejigger the basis to make the dimensions linearly independent.

I prefer to always think about the linear stuff in its preferably jiggered state and treat how to do that as a separate issue.

So you’ve got your `172 row × 81 column` matrix mapping 172→ separate dimensions into →81 dimensions. I’ll also forget about the fact that some of the resultant →81 dimensions might end up as linear combinations of the input dimensions. Just pretend that each input dimension is getting its own linear λ stretch. Now linear just means multiplication.

Linear stretches λ affect the entire dimension the same. They turn a list like `[1 2 3 4 5]` into `[3 6 9 12 15]` (λ=3). It couldn’t be into `[10 20 30 − 42856712 50]` (λ=10 except not everywhere the same stretch=multiplication).

Also remember – everything has to stay centred on 0. (That’s why you always know there will be a zero subspace.) This is linear, not affine. Things stay in place and basically just stretch (or rotate).

So if my entire 18th input dimension `[… −2 −1 0 1 2 3 4 5 …]` has to get transformed the same, to `[… −2λ −λ 0 λ 2λ 3λ 4λ 5λ …]`, then linearity has simplified this large thing full of possibility and data, into something so simple I can basically treat it as a stick `|`.

If that’s the case—if I can’t put dimensions together but just have to λ stretch them or nothing, and if what happens to an element of the dimension happens to everybody in that dimension exactly equal—then of course I can’t stick all the 172→ input dimensions into the →81 dimension output space. `172−81` of them have to go in the trash. (effectively, λ=0 on those inputs)

So then the rank-nullity theorem, at least in the linear context, has turned the huge concept of dimension (try to picture 11-D space again would you mind?) into something as simple as counting to 11 `|||||||||||`.

I feel vindicated in several ways by the Netflix Engineering team’s recent blog post explaining what they did with the results of the Netflix Prize. What they wrote confirms what I’ve been saying about recommendations as well as my experience designing recommendation engines for clients, in several ways:

1. Fancy ML techniques don’t matter so much. The winning BellKor/Pragmatic Chaos teams implemented ensemble methods with something like 112 techniques smushed together. You know how many of those the Netflix team implemented? Exactly two: RBM's and SVD.

If you’re a would-be internet entrepreneur and your idea relies on some ML but you can’t afford a quant to do the stuff for you, this is good news. Forget learning every cranny of research like Pseudo-Markovian Multibagged Quantile Dark Latent Forests! You can watch an hour-long video on OCW by Gilbert Strang which explains SVD and two hour-long Google Tech Talks by Geoff Hinton on RBM’s. RBM’s are basically a superior subset of neural network with a theoretical basis why it’s superior. SVD is a dimension reduction technique from linear algebra. (There are many Science / Nature papers on dimension reduction in biology; if you don’t have a licence there are paper-request fora on Reddit.)

Not that I don’t love reading about awesome techniques, or that something other than SVD isn’t sometimes appropriate. (In fact using the right technique on the right portion of the problem is valuable.) What Netflix people are telling us is that, in terms of a Kaggleistic one-shot on the monolithic data set, the diminishing marginal improvements to accuracy from a mega-ensemble algo don’t count as useful knowledge.

2. Domain knowledge trumps statistical sophistication. This has always been the case in the recommendation engines I’ve done for clients. We spend most of our time trying to understand the space of your customers’ preferences — the cells, the topology, the metric, common-sense bounds, and so on. You can OO program these characteristics. And (see bottom) doing so seems to improve the ML result a lot.

Another reason you’re probably safe ignoring the bleeding edge of ML research is that most papers develop general techniques, test them on famous data sets, and don’t make use of domain-specific knowledge. You want a specific technique that’s going to work with your customers, not a no-free-lunch-but-optimal-according-to-X academic algorithm. Some Googlers did a sentiment-analysis paper on exactly this topic: all of the text analysis papers they had looked at chose not to optimise on specific characteristics (like keywords or text patterns) known to anyone familiar with restaurant-review data. They were able to achieve a superior solution to that particular problem without fancy new maths, only using common sense and exploration specific to their chosen domain (restaurant reviews).

3. What you measure matters more than what you squeeze out of the data. The reason I don’t like* Kaggle is that it’s all about squeezing more juice out of existing data. What Netflix has come to understand is that it’s more important to phrase the question differently. The one-to-five-star paradigm is not going to accurately assess their customers’ attitudes toward movies. The similarity space is more like Dr Hinton’s reference to a ten-dimensional library where neighbourhood relationships don’t just go along a Dewey Decimal line but also style, mood, season, director, actors, cinematography, and yes the “People like you” metric (“collaborative filtering”, a spangled bit of jargon).

For them the preferences evolve fairly quickly over time. That has to make it hard. If your users’ preferences evolve over time: good luck, it may be quite hard.

John Wilder Tukey: "To statisticians, hubris should mean the kind of pride that fosters an inflated idea of one’s powers and thereby keeps one from being more than marginally helpful to others. … The feeling of “Give me (or more likely even, give my assistant) the data, and I will tell you what the real answer is!” is one we must all fight against again and again, and yet again." via John D Cook

Relatedly, a friend of mine who’s doing a Ph.D. in complexity (modularity in Bayesian networks) has been reading the Kaggle fora from time to time. His observation of the Kaggle winners is that they usually win with gross assumptions about either the generating process or the underlying domain. Basically they limit the ML search using common sense and data exploration; that gives them a significant boost in performance (`1−AUC`).

* I admire `@antgoldbloom` for following through on his idea and I do think they have a positive impact on the world. Which is much better than the typical “Someone should make X, that would be a great business” or even worse but still typical: "I’ve been saying they should have that!” Still, I do hold to my one point of critique: there’s no back-and-forth in Kaggle’s optimisation.

## Big Data vs Quality Data

• theLoneFuturist: I'm not certain why learning Hadoop isn't more attractive to you. If you are fine with R, doesn't having lots of data interest you?
• theLoneFuturist: Don't get me wrong, there are probably unexciting tasks associated with big data, but you'd then get to run your algorithms over big data. And lack of data is an often cited problem for learning/adaptive algorithms. But of no interest to you?
• isomorphisms: The BIG DATA fad seems to be based on "let's turn a generic algorithm loose on exabytes!"
• isomorphisms: No matter how the data was gathered, what its underlying shape/logic is, what's left out.
• isomorphisms: For example twitter text analysis. At a high level I might ask "How are attitudes changing?" "How do people talk about women differently than men?" "Do attitudes toward Barack Obama depend on the state of the US economy?" Questions whose answers aren't easy to turn into just a few numbers.
• isomorphisms: My parody of a big-data faddist's response would be all the sophistication of: listen twitter | Hadoop_grep Obama | uniq -c | well_known_sentiment_analysis_algo. Hooray! Now I know how people feel about Obama! /sarcasm
• isomorphisms: In the 'modelling vs scavenging' war (cf Leo Breiman) I'm more on the modelling side. So I find some aspects of the ML / bigdata craze unsavoury.
• isomorphisms: But the emergence of petareams is certainly a paradigm shift. I don't think the Big Data faddists are wrong in that. That environmental difference will change things as surely as cheap computing power changed statistics. (Why learn statistical theory when you can bootstrap?) As far as the art of the possible -- more clickstreams being recorded makes more analysis doable.
• isomorphisms: Anyway, to answer your question, no, having a lot of data doesn't interest me.
• isomorphisms: I'd rather have interesting data than lots of it.
• theLoneFuturist: Thing is, interesting data is probably a subset of big data. Mechanically define/separate interesting and you can get it.
• isomorphisms: Definitely not, think about historical data.
• isomorphisms: For example Angus Maddison's estimates of ancient incomes; the archaeological or geological record; unscanned text (like the Book of Kells, are you going to OCR an illuminated manuscript? You would miss the Celtic knots)
• isomorphisms: Even if stuff were OCR scanned properly and no problems with tables, the interpretive work that historians do would be hard to code up in an algorithm. To me they dig up much more interesting information than the petabytes of clickstream logs.
• isomorphisms: Or these internal documents they just found from Al-Qaeda? Which would you rather have, 100 GB of server logs or 10 kB worth of text from Osama bin Laden at a crucial moment?
• isomorphisms: Also, we talk about text being "unstructured data", how about "I smell sulphur coming from over there" (during an archaeological dig) or "This kind of quartz shouldn't be at this depth in this part of the world" or, you know, "Hey look are those dinosaur footprints?"
• isomorphisms: The kind of stuff a fisherman might notice. THAT'S unstructured data.
• theLoneFuturist: Sure, though if enough historical records get scanned, they too become the dread big data. I do catch your point, though.

## Gender Wars

"There is more difference within the sexes than between them."
‒Ivy Compton-Bennett, Mother and Son

"In all of human biology, there is no greater difference than of that between men and women."
—Some biology notes I found online

These two statements sound like rhetorical opposites, but in fact both are true.

(Says me. I can’t prove this, but I bet that taking everything into consideration, divisions between men & women are greater than those between liberals & conservatives, blacks & non-blacks, tall & short, sick & well, D&D players and people who get laid, etc.)

Let me show how both statements can logically live together harmoniously.

Just like how most men are slower than female Olympians, but at the same time the average man is faster than the average woman.

NB: Not real data.

Thanks to Stats in the Wild, here’s some real data on the ages of Olympians that makes the point:

### Measurement

Even when differences are statistically significant enough to draw conclusions (such as: “boys sprint faster than girls”), the magnitude may be really small so that the difference, while indisputable, is also unimportant. (“Statistical significance” is a confusing term in this respect.)

Consider that there are many ways you could measure differences among people. Here are some that come up frequently in the gender wars:

• height, weight, curvature, angle of femur
• IQ, SAT scores, reading tests
• speed, throwing distance, fine motor skills
• communication skills, emotional intelligence
• went to college, profession is engineer
• finding things in the refrigerator, ability to focus, ability to multitask

There are many ways to measure each of these. For example, does "speed" mean in the 100m dash, 200m dash, marathon, trail running, bike race, or triathlon? While the answers wouldn’t be independent, they wouldn’t be one-to-one either.

### A billion points in a million-dimensional space

Now you are faced with 6.7 billion points in an N-dimensional space, where N is the number of things you could measure. Let’s say like a billion points in a million-dimensional space. (Some dimensions may be collinear.)

On the one hand, there are always lots of pink and blue dots mixing in with each other (e.g. men who sew better than most women)‒and directly from Ivy’s point, the distance among pinks (variation among men) is greater than the distance from the pink centroid to the blue centroid (variation between men and women).

At the same time, though, if you had to choose just one factor by which to color these dots and get maximal classification power, it would have to be gender.

In other words, gender differences may generate a maximally separating hyperplane, but Euclidean distances between differently-gendered points are often small, and Euclidean distances between same-gendered points are often large.