Markov Chain Monte Carlo
by Andy Cooper
Posts tagged with machine learning
Michael Conover: Information Visualization for Large-Scale Data Workflows
Presented at SF Data Mining on Oct 9, 2013 The ability to instrument and interrogate data as it moves through a processing pipeline is fundamental to effecti…
I’m sorry to say I too have used the lazy robo-programmers metaphor. That was uncareful non-thinking on my part.
Trying to be more logical, what should we really conclude from the assumption that observed ↑ growth in “computer stuff” will continue apace?
Although partial least squares regression was not designed for classification and discrimination, it is … used for these purposes. For example, PLS has been used to:
- • distinguish Alzheimer’s, senile dementia of the Alzheimer’s type, and vascular dementia
- • discriminate between Arabica and Robusta coffee beans
- • classify waste water pollution
- • separate active and inactive compounds in a quantitative structure-activity relationship study
- • differentiate two types of hard red wheat using near-infrared analysis
- • distinguish transsexualism, borderline personality disorder and controls using a standard instrument
- • determine the year of vintage port wine
- • classify soy sauce by geographic region
- • determine emission sources in ambient aerosol studies
Just some things that have already been attempted with statistical text-mining — from politics to Latin.
Which pair is more different?
keyboard | keyb`ard
keyboard | keybpard
keyboard | keebored
I can think of two approaches to defining distance measures between words:
These are defined in terms of how many word-processing operations are required to correct a mis-typed word.
vimkeystrokes do I need?
and so on—those kinds of ideas.
If we could get conditional probabilities of various kinds of errors — like
ousfingers, or that I have to angle my hand weirdly in order to hit the previous couple strokes in some other word?
reflexiblewhen the document topic is gymnastics?
how dp upi apwaus fomd tjos crazu stiff?That’s almost like just one error. (It’s certainly less distance from the real sentence than a random string of characters of equal length.)
EDIT: Once I got about halfway throguh this article, I stopped correcting my typoes, so you can see the kind that I make. I was typing on a flat keyboard, asymmetrically holding a smallish non-Mac laptop (bigger than an Eee) with my elbows out, head down — except when I type fast and interchange letters, with perfect posture, “playing the piano” with my ten finger muscles rather than moving my wrists — at an ergonomic keyboard with a broken M. I actually don’t recall which way i wrote this article. I may hav eeven written it in shifts.
Here are some nice ones as well. Look at the comments section. By the posting times (and text) you can see that the debate was feverish—no time for corrections and the correspondents were steamed up emotionally. Their typoes really have personalities—for example Kien makes a lot of errors with his right middle finger moving up. (
did → dud,
is → us, promoted → promotied, inquisition → iquisition,
mean → meaqn,
Church → Chruch,
because → becuase,
Copernican → Ceprican,
your → you,
clearly → cleary) but also some errors of spelling with no sound-distance (
Pythagoras → Pythagorus) and uses both the sounds
disingenuous. Letter-switching, ilke I do, is common; a few fat-fingers (
meaqn) or forgotten letters, but this
iou stuff seems unusual and possibly characteristic of something.
Other participants make different sorts of errors, or at least with different frequencies (they’re relatively more likely to omit or switch letters than to use the wrong letter, for example). But let’s just focus on Ken because so many errors of the typoes are localised to that right middle finger. I wonder if Ken has a problem with that finger? Or maybe his keyboard is shaped in such a way that it’s difficult to correctly strike those keys specifically? (Maybe certain ergonomic keyboards would fit this — or an Eee Pc with the elbows out and “pigeon-toed” hands. But why would the errors then be localised to the right middle finger? It’s more mobile than pinky & ring fingers and we’re not taught to stick it to the homerow like the index finger.) I rule out the theory that his right hand hovers above the keyboard rather than sitting on the homerow because then he should make similar errors with
yuiop and maybe
bnm,.hjkl; as well. Also, notice that he doesn’t make comparable errors with
ewr as with
iou. How do we know he sits symmetrically? I have a tough time deciphering why there are more errors with that finger on a first read-through.
We could find more of Ken’s writing here and see how he types when he’s less agitated. I bet there are no
Ceprican's there but
Pythagorus would still be. As for
Chruch? Hmmm. Don’t know.
Now the big-data-ists (the other half of Leo Breiman’s partition of statistical modellers -vs- data miners) would probably say “Google has a jillion search results including measurements of people correcting themselves and including time series of the letters people type — so just throw some naive Bayes at that pile and watch it come to the correct answer!” Maybe they’re right.
If someone wants to mess around with this stuff with me — leave me a comment. We could grab tweets and analyse typoes within differnet text-…[by which tool] was used to send the tweet. For example the Twitter website means it was keyboard-typed, certain mobile devices have Swype, other errors we might be able to guess tha tis …[that it’s] a T9 mobile keyboard.
I feel vindicated in several ways by the Netflix Engineering team’s recent blog post explaining what they did with the results of the Netflix Prize. What they wrote confirms what I’ve been saying about recommendations as well as my experience designing recommendation engines for clients, in several ways:
Relatedly, a friend of mine who’s doing a Ph.D. in complexity (modularity in Bayesian networks) has been reading the Kaggle fora from time to time. His observation of the Kaggle winners is that they usually win with gross assumptions about either the generating process or the underlying domain. Basically they limit the ML search using common sense and data exploration; that gives them a significant boost in performance (
* I admire
@antgoldbloom for following through on his idea and I do think they have a positive impact on the world. Which is much better than the typical “Someone should make X, that would be a great business” or even worse but still typical: "I’ve been saying they should have that!” Still, I do hold to my one point of critique: there’s no back-and-forth in Kaggle’s optimisation.
visualisation of how the kernel trick makes a non-separable collection of points linearly separable.
I guess the kernel mappings really add a dimension, rather than replacing a dimension, don’t they.
Upon my return [to academia, after years of private statistical consulting], I started reading the Annals of Statistics … and was bemused. Every article started with:
Assume that the data are generated by the following model…
followed by mathematics exploring inference, hypothesis testing, and asymptotics…. I [have a] very low … opinion … of the theory published in the Annals of Statistics. [S]tatistics [is] a science that deals with data.
The linear regression model led to many erroneous conclusions that appeared in journal articles waving the 5% significance level without knowing whether the model fit the data. Nowadays, I think most statisticians will agree that this is a suspect way to arrive at conclusions.
In the mid-1980s … A new research community … sprang up. Their goal was predictive accuracy….. They began working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.
The advances in methodology and increases in predictive accuracy since the mid-1980s that have occurred in the research of machine learning has been phenomenal…. What has been learned? The three lessons that seem most important: