Quantcast

Posts tagged with Netflix

I feel vindicated in several ways by the Netflix Engineering team’s recent blog post explaining what they did with the results of the Netflix Prize. What they wrote confirms what I’ve been saying about recommendations as well as my experience designing recommendation engines for clients, in several ways:

  1. Fancy ML techniques don’t matter so much. The winning BellKor/Pragmatic Chaos teams implemented ensemble methods with something like 112 techniques smushed together. You know how many of those the Netflix team implemented? Exactly two: RBM's and SVD.

    If you’re a would-be internet entrepreneur and your idea relies on some ML but you can’t afford a quant to do the stuff for you, this is good news. Forget learning every cranny of research like Pseudo-Markovian Multibagged Quantile Dark Latent Forests! You can watch an hour-long video on OCW by Gilbert Strang which explains SVD and two hour-long Google Tech Talks by Geoff Hinton on RBM’s. RBM’s are basically a superior subset of neural network with a theoretical basis why it’s superior. SVD is a dimension reduction technique from linear algebra. (There are many Science / Nature papers on dimension reduction in biology; if you don’t have a licence there are paper-request fora on Reddit.)

    Not that I don’t love reading about awesome techniques, or that something other than SVD isn’t sometimes appropriate. (In fact using the right technique on the right portion of the problem is valuable.) What Netflix people are telling us is that, in terms of a Kaggleistic one-shot on the monolithic data set, the diminishing marginal improvements to accuracy from a mega-ensemble algo don’t count as useful knowledge.


  2. Domain knowledge trumps statistical sophistication. This has always been the case in the recommendation engines I’ve done for clients. We spend most of our time trying to understand the space of your customers’ preferences — the cells, the topology, the metric, common-sense bounds, and so on. You can OO program these characteristics. And (see bottom) doing so seems to improve the ML result a lot.

    Another reason you’re probably safe ignoring the bleeding edge of ML research is that most papers develop general techniques, test them on famous data sets, and don’t make use of domain-specific knowledge. You want a specific technique that’s going to work with your customers, not a no-free-lunch-but-optimal-according-to-X academic algorithm. Some Googlers did a sentiment-analysis paper on exactly this topic: all of the text analysis papers they had looked at chose not to optimise on specific characteristics (like keywords or text patterns) known to anyone familiar with restaurant-review data. They were able to achieve a superior solution to that particular problem without fancy new maths, only using common sense and exploration specific to their chosen domain (restaurant reviews).



  3. What you measure matters more than what you squeeze out of the data. The reason I don’t like* Kaggle is that it’s all about squeezing more juice out of existing data. What Netflix has come to understand is that it’s more important to phrase the question differently. The one-to-five-star paradigm is not going to accurately assess their customers’ attitudes toward movies. The similarity space is more like Dr Hinton’s reference to a ten-dimensional library where neighbourhood relationships don’t just go along a Dewey Decimal line but also style, mood, season, director, actors, cinematography, and yes the “People like you” metric (“collaborative filtering”, a spangled bit of jargon).

    For them the preferences evolve fairly quickly over time. That has to make it hard. If your users’ preferences evolve over time: good luck, it may be quite hard.

    John Wilder Tukey: "To statisticians, hubris should mean the kind of pride that fosters an inflated idea of one’s powers and thereby keeps one from being more than marginally helpful to others. … The feeling of “Give me (or more likely even, give my assistant) the data, and I will tell you what the real answer is!” is one we must all fight against again and again, and yet again." via John D Cook 

Relatedly, a friend of mine who’s doing a Ph.D. in complexity (modularity in Bayesian networks) has been reading the Kaggle fora from time to time. His observation of the Kaggle winners is that they usually win with gross assumptions about either the generating process or the underlying domain. Basically they limit the ML search using common sense and data exploration; that gives them a significant boost in performance (1−AUC).

* I admire @antgoldbloom for following through on his idea and I do think they have a positive impact on the world. Which is much better than the typical “Someone should make X, that would be a great business” or even worse but still typical: "I’ve been saying they should have that!” Still, I do hold to my one point of critique: there’s no back-and-forth in Kaggle’s optimisation.




Another unmeasurable distance is ★★★ movie ratings.

Movie ratings are drawn from the set {★,★★,★★★,★★★★,★★★★★} and related by the total ordering >:

  • ★★★★★ > ★★★★
  • ★★★★ > ★★★
  • ★★★ > ★★
  • ★★ > ★

and the transitivity of > gives the rest of the relations.

TWO PLUS TWO ≠ FOUR

However ★★★★ is not the same thing as 4, because 4 comes with all the baggage of being an integer. Baggage like the usual metric whereunder

  • |42| = 2,

whereas |★★★★−★★| ≠ ★★.

If one naïvely assumed {★,★★,★★★,★★★★,★★★★★} ≅ {1,2,3,4,5}, that would mean

  • |★★★−★|  =  |3−1|  =  |★★★★★−★★★|  =  |2|.

Which would be wrong.

HOW I THINK PEOPLE USE ★★★ RATINGS

There’s no reason to believe that the distance between ★★★★★ and ★★★ is 2 or that it’s the same distance as between ★★★ and ★. I believe there is a wider gulf between ★★★ and ★ for most people.

It depends on the person but creo q’ a lot of people basically only use three, four, and five stars. Mostly they just use four and five stars because they only watch movies they like.

Then when faced with two choices (★★★★ versus ★★★★★) they may think back to other movies they’ve rated, and wish they had a finer scale of gradation, or just something else to say about them — like in an orthogonal direction.

People use ★★ and ★ but not no creo very judiciously. It’s kind of like the hotness scale … but that’s another topic.



WHAT’S NEXT

I actually have a long, in-depth critique of the ★★★★ system—which also suggests better ways to do surveys in general. But I’ve gone on too long already so let me just preview that critique by saying:

Bad data in, bad recommendations out. Don’t blame yourself, Netflix Prize contestants.

PS: Really wanted a subjunctive mood while writing this. Thanks a lot, English Language. Not.




The Netflix Prize was awarded to the team with the algorithm that most accurately guessed people’s movie tastes. Accurate, according to some measure: root-mean-squared error, or the L₂ norm.

In my opinion, that’s the wrong measure of success. Netflix selected for algorithms that predicted well across all data, penalizing large misses extra. But that’s not what makes a recommendation algorithm good.

The best algorithm, I think, should observe my tastes and recommend just one product that I’ve never heard of (or at least never tried), that I absolutely love. It’s OK if I like a movie and you show me another one by the same director — but I could have done that myself. The best algorithm would say:

You like Cowboy Bebop + Out Of Africa + Winged Migration so you will like = Seven Samurai.

Cowboy Bebop indicates that I like Asian sh*t; Out Of Africa is an old classic; Winged Migration doesn’t have a lot of talking. Put them together and you get an Asian classic without a lot of talking.

That’s just an example of a recommendation that would fit my criteria of goodness.

In other words,

  1. only the "most recommended" movie matters
  2. it should blow me away
  3. it should be surprising.

RMSE fails #1 because accuracy in the highest recommendation matters just as much as accuracy in every other recommendation.

As a result, today’s recommendation engines are conservative in the wrong ways and basically hack together machine learning fads.