Posts tagged with basket weave knots

Big Data vs Quality Data

  • theLoneFuturist: I'm not certain why learning Hadoop isn't more attractive to you. If you are fine with R, doesn't having lots of data interest you?
  • theLoneFuturist: Don't get me wrong, there are probably unexciting tasks associated with big data, but you'd then get to run your algorithms over big data. And lack of data is an often cited problem for learning/adaptive algorithms. But of no interest to you?
  • isomorphisms: The BIG DATA fad seems to be based on "let's turn a generic algorithm loose on exabytes!"
  • isomorphisms: No matter how the data was gathered, what its underlying shape/logic is, what's left out.
  • isomorphisms: For example twitter text analysis. At a high level I might ask "How are attitudes changing?" "How do people talk about women differently than men?" "Do attitudes toward Barack Obama depend on the state of the US economy?" Questions whose answers aren't easy to turn into just a few numbers.
  • isomorphisms: My parody of a big-data faddist's response would be all the sophistication of: listen twitter | Hadoop_grep Obama | uniq -c | well_known_sentiment_analysis_algo. Hooray! Now I know how people feel about Obama! /sarcasm
  • isomorphisms: In the 'modelling vs scavenging' war (cf Leo Breiman) I'm more on the modelling side. So I find some aspects of the ML / bigdata craze unsavoury.
  • isomorphisms: But the emergence of petareams is certainly a paradigm shift. I don't think the Big Data faddists are wrong in that. That environmental difference will change things as surely as cheap computing power changed statistics. (Why learn statistical theory when you can bootstrap?) As far as the art of the possible -- more clickstreams being recorded makes more analysis doable.
  • isomorphisms: Anyway, to answer your question, no, having a lot of data doesn't interest me.
  • isomorphisms: I'd rather have interesting data than lots of it.
  • theLoneFuturist: Thing is, interesting data is probably a subset of big data. Mechanically define/separate interesting and you can get it.
  • isomorphisms: Definitely not, think about historical data.
  • isomorphisms: For example Angus Maddison's estimates of ancient incomes; the archaeological or geological record; unscanned text (like the Book of Kells, are you going to OCR an illuminated manuscript? You would miss the Celtic knots)
  • isomorphisms: Even if stuff were OCR scanned properly and no problems with tables, the interpretive work that historians do would be hard to code up in an algorithm. To me they dig up much more interesting information than the petabytes of clickstream logs.
  • isomorphisms: Or these internal documents they just found from Al-Qaeda? Which would you rather have, 100 GB of server logs or 10 kB worth of text from Osama bin Laden at a crucial moment?
  • isomorphisms: Also, we talk about text being "unstructured data", how about "I smell sulphur coming from over there" (during an archaeological dig) or "This kind of quartz shouldn't be at this depth in this part of the world" or, you know, "Hey look are those dinosaur footprints?"
  • isomorphisms: The kind of stuff a fisherman might notice. THAT'S unstructured data.
  • theLoneFuturist: Sure, though if enough historical records get scanned, they too become the dread big data. I do catch your point, though.