Don Boudreaux discussing Armen Alchian's preference for clear prose over "mathematical pyrotechnics" reminded me of a few neural networks researchers I know. I won't name names, because it wasn't a favorable comparison. There's far too much equation-based whizz-bangery going on in some papers.
I use to think the problem was insufficient sophistication in my own math background, but I've recently heard independently from two very smart people in our Applied Math/Scientific Computing program that they also find the math in a lot of these papers to be more of an obfuscating smoke screen than a clarifying explication. If they find it hard to follow I've got good reason to believe the problem isn't just me.
I've messed around with the problem a bit and developed a very rough approximation. Actually, I don't even feel comfortable calling this an approximation. But it is at least the scaffolding you could build an eventual solution around.
It imports the relevant data from the Census (after a lot of PitA data cleaning) and throws it up on the screen, connecting neighboring counties. Then it picks 50 random counties to be the centers of the states and runs a k-means update on them (weighted by the county populations) to iterate to a better solution. The only wrinkle to a basic clustering is that it uses the population of a state to set a z-coordinate value for each centroid when determining the distance to each point. States with more people in them are represented higher up, making them effectively smaller.
The image above is what you end up with; here's where the system started:
The ratio of the most populous state to the least in America now is 66:1. As simple as this system is, some of the runs I tested on got that ratio down to 11:1.
In the sixth grade my Social Studies teacher lent me a book which had a whole section devoted to alternative maps of the US. What would it look like with a dozen states? Or 200? What if state lines were drawn according to climate and vegetation patterns, or ancestral ethnicities, or any other speculative idea? I'm fascinated by maps generally, and these were no exception.
Here's a example of a speculative map of the US by Neil Freeman that's been making the rounds in the last few days:
My first reaction to seeing this was: Cool. Now how to do I automate this? Those two thoughts go together in my head a lot. Rather than just idly speculating how to get a machine to do this, I'm actually going to give it a try. Or at least give a try at a first approximation.
Freeman's method was initialized algorithmically, but a lot of the decisions were subjective. He also uses a lot of data (drainage basins, commuting patterns) that would be too much of a pain for me to track down an integrate for this sort of spare-time project.
I'm starting by doing everything on a county-by-county basis, which is a useful limit to work with. Without limiting myself to some relatively coarse-grained pre-existing boundaries the computational possibilities would be staggeringly vast, and the result would probably look too foreign anyway. Limiting things to the 3,033 counties and county-equivalents (Louisianan parishes, Alaskan boroughs, Virginian independent cities, ...) will make things more manageable.
I'll need population data for each county, which I can get from the Census Bureau. I'll also need an adjacency graph of counties, which I thought would be tricky, but it turns out the Census Bureau has me covered their too. The location of each county is given in that first link. For the final rendering I'll need the shape of each county. Location is simple, but I'll have to do some digging for shape.
Methods... methods... hmmm.
A clustering approach seems like a natural way to start. Something like k-means where you start with some arbitrary, incorrect centers for each state, and move them around until each state is approximately equally sized. That would at least give you something to refine from.
K-means is related to Learning Vector Quantization, which was an ancestor of Self-Organizing Maps, so that's another direction to explore. I'm not sure how you'd enforce the population similarity complaint off the top of my head. Or, for that matter, how you'd want to encode any of the information in feature space. (This 2011 paper by Skupin and Esperbé seems like a solid lead [pdf].) I'm not sure how an SOM would work for this but I have a strong feeling there's potential down that road.
There's definitely a greedy hill-climbing approach that could be easily implemented. Which means there's a Simulated Annealing approach as well. No way of knowing a priori how effective either would be.
Those techniques — and many others — will need a fitness function. It would be simple enough to write one based on the inequality in state populations. (Ratio of smallest to largest, Gini coefficient, ...) There should be a penalty for non-compact states, perhaps based on perimeter-to-area raio. But that may cause a problem for states along coast lines or rivers. Perhaps use the adjacency graph and penalize states with large diameters, weighted by their actual distance in geographic-space, relative to the number of counties. In any event, Hawaii and Alaska may need special attention for any distance-based rules.
Genetic Algorithms might come in handy, especially if you already had a reasonable starting point to refine from. How would a chromosome be represented though? Each county is one gene, taking available values of 1-50? The state space would be huge. Something better than the naive approach would be needed.
I'm sure there are notes in either my Computational Geometry or GIS notebooks that would come in handy. I'll have to flip through those next time I'm in the lab. Maybe my Algorithmic Game Theory notes as well; I seem to remember some relevant problems on siting cell towers that may be adaptable.
This should give me plenty to get started with. I've got a lot of other irons in the fire, but hopefully in a few days I'll be back with another post laying out my preliminary efforts.
So does Netflix have an edge? Is there any reason to think they can flourish where so many have failed? The apparent answer is data. Netflix has lots and lots of data. They know what we watch, when we watch, where we stop watching, where we repeat a scene, where we reach for the fast-forward button, and most critically, when we break off and move on. They know which movies sell well at 8:00 on a Friday night and which ones we like to watch on Sunday afternoon. They can surmise which directors, writers, and stars produce the most watchable entertainment. They have magnificent data.
(1) Yes, data is their edge. (2) They don't need to make better content than everyone else. They just need to make content good enough to give them bargaining chips when they strike deals with other content makers and distributors.
And that's a tragedy. Netflix has so much data that they are going to be tempted to climb into the creative tent and start offering "advice."
They can claim to know exactly what works and what does not. Well, sorry, no. Knowing that something works leaves us a long way from knowing why something works. And this leaves us a long way from knowing how to reproduce it in another movie. The only thing this data can be absolutely sure to produce is arrogance. We have seen this mistake before.
Yes, they can claim to know exactly what works and what does not and why. Or they could not. There's nothing inherent in a quantitative approach that rules out epistemic humility. In fact, there's much to quantitative reasoning that makes it more humble. When's the last time you saw someone run a t-test on an executive's intuitions or gut feelings?
This means that whatever the data say, Netflix cannot tell a director, "We need a fight scene here." And it really can't say, "We need a fight scene at the 14-minute mark." Doing so, will not only drive creatives away, but viewers as well. As Henry Jenkins has said, viewers are newly sophisticated and critical. They can see formula a long way off. They can see plot mechanics the second they hit the screen. And the moment this happens, they are off.
Hold on a minute. Why would you assume that Netflix's results would be more formulaic than the traditional Hollywood approach? Humans can only sort out cause and effect when there are a couple of moving pieces. Computerized pattern recognition can do so in much more complicated environments. Doesn't it stand to reason that Netflix's discovered patterns will be more complex, and therefore less formulaic and noticeable, than the patterns that intuition- and tradition-guided producers hew to?
Netflix, therefore, will have to temper their itch to intervene. Naturally, we are not talking carte blanche here. We are not saying that we take any artist and turn them loose. Because we know a great deal of capital has been squandered by creatives keen to prove how artistic and avant garde they are. No, what we need are culture producers who are — in the language of Goldilocks — "just right." They need to be able to tell a story and obey some of the story-telling conventions even as they do new and interesting things to break and bend those conventions. Only then will painters paint and patrons watch.
The advice in this post true for every company producing creative output. It's masquerading as being specifically about Netflix. It's not only more general than it's made out to be, it's arguably less applicable to Netflix than to their competitors.
I'm also more than a little weary of critiques being made against numerical decision making without any consideration of the faults of the non-numeric decision making it's displacing.
On Friday, Netflix will release a drama expressly designed to be consumed in one sitting: “House of Cards,” a political thriller starring Kevin Spacey and Robin Wright. Rather than introducing one episode a week, as distributors have done since the days of black-and-white TVs, all 13 episodes will be streamed at the same time. “Our goal is to shut down a portion of America for a whole day,” the producer Beau Willimon said with a laugh.
Ad-financed shows — still a clear majority of viewing — may prefer to have impressions from the ads spread out over weeks and months rather than concentrated in one long marathon sitting.
On the other hand, when watching an hour long show — or even a half-hour — I routinely see the same ad multiple times. Not ads for the same product or service, but the very same advertisement. I am sure there is a lot of literature about the trade-offs between repetition and staleness to doing this. (Note to self: ask about this at the next marketing quant lunch.)
Furthermore the show itself relies more heavily on an effective and immediate burst of concentrated marketing, with little room to build word of mouth and roll out a campaign with stages.
Yes, you lose word-of-mouth, but you also lose the inevitable week-to-week decay as people drop out of the viewership pool. Most TV shows show a remarkably consistent exponential decay in viewership. It's not at all clear to me that the gaining from WOM and the losing from audience decay is preferable to having neither.
This is being framed as a contest between watching 13 episodes in one day and watching them over four months. My wife and I have been watching one episode of "House of Cards" every day or so. I think this middle ground may be a better solution than either extreme. A two week roll-out keeps viewers focused and concentrates marketing, but doesn't roll the dice on one big push.
Note that Netflix has an advantage that other outlets don't: they can continue to advertise the show for free through their service. This won't drive new members to subscribe, but I think they benefit even when existing members watch the show. True, it doesn't boost revenue, but racking up higher viewership both makes it easier for them to create high-quality shows in the future, and it strengthens their in-house productions as a bargaining chip when negotiating with other content producers and distributors, which I think is the real value of "House of Cards."
One media market which is still highly serialized and has clearly not come to grips with the implications of that is comics. Here is just one recent piece about this. People have been fretting over the serialization-vs-collection transition and the friction it causes since I started reading comics six years ago, and they don't seem any closer to resolving the tension.
PS "House of Cards" is very highly recommended. I haven't had a show I was this excited about binge-watching in a couple of years.