Netflix, Algorithms, and Hard Working Humans

There has been some press today and yesterday surrounding Alexis Madrigal’s article in The Atlantic [“How Netflix Reverse Engineered Hollywood“] on the genres Netflix uses to classify films – not just its films, but all films and television programmes. It’s a great article, and a good example of the kinds of approaches that inform the more valuable digital humanities work being done today. You should read it.

Given my current projects, I was especially interested in Madrigal’s piece for two of its aspects, neither of which has to do with cinema. The first has to do with human minds, and the second with algorithms. [There’s a third: the cycle of interaction between these].

But first, the jist: Netflix categorizes its offerings within over 76,000 genres, which can seem oddly, even absurdly specific, e.g.: “Time Travel Movies starring William Hartnell”, “Romantic Indian Crime Dramas”, “Evil Kid Horror Movies”, “Visually-striking Goofy Action & Adventure”, or “British set in Europe Sci-Fi & Fantasy from the 1960s”. Even if someone has enjoyed a romantic Indian crime drama, it’s unlikely there’s such a thing as a “buff” of such a thing, and it’s hard to think that anyone would set out to find that particular combination of attributes in a movie.

One can see why Netflix would want to label a flick [all flicks] this way, though: without the browseable shelves of a local video store [remember those?], it needs a way of organizing titles together, and criteria for suggesting titles to viewers. Displaying such detailed metadata in a list of watching suggestions [however weird it may seem in the guise of a “genre”] can clearly and succinctly communicate relevant basic details to a potential viewer.

Madrigal noticed the weird specificity of the Netflix genres, and then noticed that he could go collect a list of these genres automatically, and came up with 76,897 genres, or “micro-genres”, which Netflix calls “altgenres”. In true DH fashion, he alternated between exploration, experimentation, and analysis until he arrived at the basic principles of the Netflix algorithms, and got a bunch of results on popular topics, actors, and directors, which he put into graphs. More on that below.

The first thing that struck me about the Netflix genre taxonomy was not the algorithms that built the names for the genres, but the sheer hard human work, paid for by Netflix that went into gathering the metadata for the algorithm to work with. Madrigal reports that Netflix

paid people to watch films and tag them with all kinds of metadata. This process is so sophisticated and precise that taggers receive a 36-page training document that teaches them how to rate movies on their sexually suggestive content, goriness, romance levels, and even narrative elements like plot conclusiveness.

As Madrigal says, “They could have purely used computation. […] But they went beyond that approach to look at the content itself.” This detailed, accurate, and ultimately revealing human work should be contrasted, on several axes, to this similar work within the “topic modelling” framework in the digital humanities, discussed at some length here on Language Log. To cut it short, the expensive Netflix method enlists paid-for minds that already know something about the topic [so-called “domain experts”] to generate  metadata, whereas the cheap DH method tries to get the computer to do it automatically. There may be future benefits to such DH experiments, but it’s hard to argue with the results of these two experiments.

None of this is to say that Netflix could do what it does without algorithms – engineers had to write a script to sort through the data they were supplied. The algorithm itself is no big deal. Madrigal was able to reverse engineer it to something like: “Region + Adjectives + Noun Genre + Based On… + Set In… + From the… + About… + For Age X to Y”. Compared to the various Google books algorithms, this is rudimentary. But it’s working with excellent metadata.

The interesting thing about the algorithm is that in addition to producing all [or most of] the results that one might expect, it also produces some unexpected ones, for necessarily unknown reasons. Madrigal highlights one of these mysteries, which centres on a mystery-show, Perry Mason. Madrigal sorted out the terms that showed up in the most categories, by type of term. For “star”, he found that:

Sitting atop the list of mostly expected Hollywood stars is Raymond Burr, who starred in the 1950s television series Perry Mason. Then, at number seven, we find Barbara Hale, who starred opposite Burr in the show.

For “director”, he found that the most-mentioned name among the categories was “Christian I. Nyby II” who “directed several Perry Mason made-for-TV movies in the 1980s.”

So that’s weird. These are no Bruce Willises (number 2 actor) or Woody Allens (number 4 director). But that’s what often happens when you design an algorithm – no matter how much you improve it can always give you back something weird.

Madrigal is at his best when thinking through this fact. He starts from experience:

All 76,897 genres that my bot eventually returned, were formed from these [human evaluated] basic components. While I couldn’t understand that mass of genres, the atoms and logic that were used to create them were comprehensible. I could fully wrap my head around the Netflix system.

The data is big and human. The algorithm is small and digital. They meet, and the results are part predictable, part unpredictable. After a process of refinement, good enough becomes good enough if you just want to use the thing, but if you want to think about it, the exceptions [if you can call Raymond Burr’s win an exception] are a site for reflection. Madrigal gets this:

To me, that’s the key step: It’s where the human intelligence of the taggers gets combined with the machine intelligence of the algorithms. There’s something in the Netflix personalized genres that I think we can tell is not fully human, but is revealing in a way that humans alone might not be.

Netflix’s guy, Todd Yellin, gets this too. He’s the one who designed the system, and presumably oversaw the data gathering, the handbook, and so on. He’s quoted as saying:

“Let me get philosophical for a minute. In a human world, life is made interesting by serendipity. […] The more complexity you add to a machine world, you’re adding serendipity that you couldn’t imagine. Perry Mason is going to happen. These ghosts in the machine are always going to be a by-product of the complexity. And sometimes we call it a bug and sometimes we call it a feature.

Here at P&C, our current theoretical focus is “contingency”–roughly synonymous with “serendipity”, at least as we conceive it–in poetry. But the work we’ve been doing in DH has largely been fueled by an analogous serendipity, arising from putting computer algorithms to work on poetic corpora.

As much as I like Yellin’s way of putting it, Madrigal has it somewhat differently in the Fresh Air piece that ran on the same topic. On the radio, he says,

Only some of the logic that drives these categories feels human. But perhaps that’s exactly what we like about Netflix’s recommendations. They take our taste, break it down into its constituent parts, and spit it back to us in new and revealing ways. Netflix’s strange machine wants to make us happy. And to do so, it must know us and our culture in ways that are not always obvious to humans.

Something “spit back to us in new and revealing ways” is compatible with the idea of digital method I sketched out in my article “Method as Tautology in the Digital Humanities“. But I’m not sure it’s our taste that’s being returned to us in revealing ways. That is, I’m not so convinced that this analysis shows us very much about the “American soul”, or even the culture, as Madrigal claims. The Perry Mason mystery shows just that – I don’t believe (and neither does Madrigal) that there is a phantom nation of Raymond Burr devotees haunting the digital corridors.

What’s returned to us are the ideas that went into writing the 36-page training document, combined with the logic that went into designing the algorithm. And that’s not trivial – in attempting to recreate with computing speed and scale a human-like genre-classification faculty, we’re confronted with results that we never realized were a logical and inevitable result of the way we described our own human genre classification strategy. It’s not the culture that’s revealed, exactly, but the way we understand our culture — and that’s the proper disciplinary focus of the humanities.

No Comments

Leave a Reply