Quantcast

Posts tagged with Leo Breiman

Big Data vs Quality Data

  • theLoneFuturist: I'm not certain why learning Hadoop isn't more attractive to you. If you are fine with R, doesn't having lots of data interest you?
  • theLoneFuturist: Don't get me wrong, there are probably unexciting tasks associated with big data, but you'd then get to run your algorithms over big data. And lack of data is an often cited problem for learning/adaptive algorithms. But of no interest to you?
  • isomorphisms: The BIG DATA fad seems to be based on "let's turn a generic algorithm loose on exabytes!"
  • isomorphisms: No matter how the data was gathered, what its underlying shape/logic is, what's left out.
  • isomorphisms: For example twitter text analysis. At a high level I might ask "How are attitudes changing?" "How do people talk about women differently than men?" "Do attitudes toward Barack Obama depend on the state of the US economy?" Questions whose answers aren't easy to turn into just a few numbers.
  • isomorphisms: My parody of a big-data faddist's response would be all the sophistication of: listen twitter | Hadoop_grep Obama | uniq -c | well_known_sentiment_analysis_algo. Hooray! Now I know how people feel about Obama! /sarcasm
  • isomorphisms: In the 'modelling vs scavenging' war (cf Leo Breiman) I'm more on the modelling side. So I find some aspects of the ML / bigdata craze unsavoury.
  • isomorphisms: But the emergence of petareams is certainly a paradigm shift. I don't think the Big Data faddists are wrong in that. That environmental difference will change things as surely as cheap computing power changed statistics. (Why learn statistical theory when you can bootstrap?) As far as the art of the possible -- more clickstreams being recorded makes more analysis doable.
  • isomorphisms: Anyway, to answer your question, no, having a lot of data doesn't interest me.
  • isomorphisms: I'd rather have interesting data than lots of it.
  • theLoneFuturist: Thing is, interesting data is probably a subset of big data. Mechanically define/separate interesting and you can get it.
  • isomorphisms: Definitely not, think about historical data.
  • isomorphisms: For example Angus Maddison's estimates of ancient incomes; the archaeological or geological record; unscanned text (like the Book of Kells, are you going to OCR an illuminated manuscript? You would miss the Celtic knots)
  • isomorphisms: Even if stuff were OCR scanned properly and no problems with tables, the interpretive work that historians do would be hard to code up in an algorithm. To me they dig up much more interesting information than the petabytes of clickstream logs.
  • isomorphisms: Or these internal documents they just found from Al-Qaeda? Which would you rather have, 100 GB of server logs or 10 kB worth of text from Osama bin Laden at a crucial moment?
  • isomorphisms: Also, we talk about text being "unstructured data", how about "I smell sulphur coming from over there" (during an archaeological dig) or "This kind of quartz shouldn't be at this depth in this part of the world" or, you know, "Hey look are those dinosaur footprints?"
  • isomorphisms: The kind of stuff a fisherman might notice. THAT'S unstructured data.
  • theLoneFuturist: Sure, though if enough historical records get scanned, they too become the dread big data. I do catch your point, though.




Upon my return [to academia, after years of private statistical consulting], I started reading the Annals of Statistics … and was bemused. Every article started with:


Assume that the data are generated by the following model…


followed by mathematics exploring inference, hypothesis testing, and asymptotics…. I [have a] very low … opinion … of the theory published in the Annals of Statistics. [S]tatistics [is] a science that deals with data.

The linear regression model led to many erroneous conclusions that appeared in journal articles waving the 5% significance level without knowing whether the model fit the data. Nowadays, I think most statisticians will agree that this is a suspect way to arrive at conclusions.

In the mid-1980s … A new research community … sprang up. Their goal was predictive accuracy….. They began working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.

The advances in methodology and increases in predictive accuracy since the mid-1980s that have occurred in the research of machine learning has been phenomenal…. What has been learned? The three lessons that seem most important:

  • Rashomon: the multiplicity of good models;
  •           • Occam: the conflict between simplicity and accuracy;
  •           • Bellman: dimensionality — blessing or curse

Leo Breiman, The Two Cultures of Statistics (2001)

(which are: machine learning / artificial intelligence / algorithmists —vs— model builders / statistics / econometrics / psychometrics)