Category Archives: CS / Science / Tech / Coding

"Social networks in fact and fiction"

The Endeavour :: John Cook :: Social networks in fact and fiction

In a nutshell, the authors hope to get some insight into whether a myth is based on fact by seeing whether the social network of characters in the myth looks more like a real social network or like the social network in a work of deliberate fiction. For instance, the social networks of the Iliad and Beowulf look more like actual social networks than does the social network of Harry Potter. Real social networks follow a power law distribution more closely than do social networks in works of fiction.

If you read one version of Beowulf and it's not Seamus Heaney's translation you better have a very good reason.
If you read one version of Beowulf and it's not Seamus Heaney's translation you better have a very good reason.
Cool.

I vaguely remember reading about some astronomer's using Homer's descriptions of constellations to pin down dates for various events in the Odyssey. Perhaps it was this PNAS paper by Baikouzis & Magnasco?

It seems however that an accurate historical account might have a suspicious social network, not because the events in it were made up but because they were filtered according to what the historian thought was important.

Indeed. Although I suspect this form of Narrative Bias would be less of a problem with Beowulf, the Illiad, etc., because the composers of those tales, and their audiences, had less exposure to the way fiction is "supposed" to be.

I would like to see someone do similar analysis for contemporary non-fiction. I prefer authors who tell a sprawling, tangled, less narratively driven story (Keay, Mann, Lewis, and Mukherjee come to mind) to one that fits a more conventional structure.  It's difficult for me to accept a story that takes place in a high causal density environment and yet somehow only a couple of people have any agency.

Nate Silver apparently feels the same way:

When we construct these stories, we can lose the ability to think about the evidence critically. Elections typically present compelling narratives. Whatever you thought about the politics of Barack Obama or Sarah Palin or John McCain or Hillary Clinton in 2008, they had persuassive life stories: reported books on the campaign, like Game Change, read like tightly bestselling novels.
— The Signal and the Noise, p. 59

Those are not the books I'm interested in.


NLP is very much not my area, but I can't help but wonder about automatically generating some sort of metric about this for a book. Count up how many proper names appear, and build a graph of them based on how closely together they are linked in the text. (That is, two names appearing often in the same sentence are more closely linked than two which appear in consecutive paragraph.) Perhaps simply looking at the histogram of name occurrence frequency might give you some preliminary ability to separate books into "realistic social structures" and "fiction-like social structures."

Posted in CS / Science / Tech / Coding | Tagged , | Leave a comment

Conway's Game Of Life in APL

APL is the sort of thing I imagine "Programmer-at-Arms" Pham Nuwen used in Vernor Vinge's "Zones of Thought" books. It seems both incomprehensibly magical while also impeccably rational, forming the shortest bridge between what the programmer is thinking and what the computer is calculating.

Of course it also seems like you'd be building a bridge no one else could walk over. Not even you-one-year-from-now. I have a hard enough time figuring out what I was thinking when I read Ruby I wrote a year ago.


For more, here's a written tutorial by the same fellow. It had this helpful poscript regarding "glider guns"

Artillery in a finite manifold is a dangerous business.

Hehe. Good advice, I suppose.

Aficionados of the 1979 arcade game "Asteroids", which took place on a torus, will recall that the manufacturers thoughtfully gave their canon a range just short of the screen's circumferences.

Found that out the hard way back in undergrad when we wrote a genetic algorithm which learned to play Asteroids.

Posted in CS / Science / Tech / Coding | Leave a comment

Manfred Mohr

I saw this post on Manfred Mohr's (very!) early digital art literally minutes after picking up a book about Mohr's work from the library yesterday. Total coincidence.

p231, Manfred Mohr
p231, Manfred Mohr

I like a lot of the pioneering digital artists' work. I think the restraints gave them a lot of focus. (Limits boost creativity.) You don't see many of those first generation guys still producing like Mohr does though. It's also refreshingly unusual that Mohr hasn't gotten bogged down in self-consciously retro, 8-bit, glitch kitsch.

(Dear [far too many digital artists]: I also remember that sprite animations were once a cutting-edge thing. Move on.)

Back when I was taking Computational Geometry I remember getting sidelined working out an algorithm based on the six dimensional rotations in Mohr's space.color.motion. At some point I need to dig my notes out for that and implement it.

If you like Mohr you may want to check out this interview as well. I like what he says at the beginning about taking a simple idea and running with it. Also the questions about an artist working like a scientist were good. This also stood out:

So I went higher and higher in dimension and it got more and more complicated. I could do different things, but I could not explain to people because it was so complicated. '6-D cube, what kind of crap are you talking about?'

I've had that experience. Here's a free tip for you: do not try to explain something to a room full of psychologists by beginning "So if you were to imagine yourself moving around on the surface of a 35-dimensional hypercube..."

This interview also displays the link between algorithmic art and political economy. Mohr sets up a system, and lets it do its thing. The artwork emerges from the software. Legislators and regulators and executives set up a system and it does its thing. The economy emerges from society. Mohr's work, like all good algorithmic art, reminds me of Hayek's description of spontaneous order: "of human intent but not of human design."


PS Prosthetic Knowledge points out that Mohr also maintains a very thorough YouTube channel. Would that all digital artists did similar.

Posted in Art, CS / Science / Tech / Coding | Leave a comment

Reader

Google is shutting down the Google Reader service, as you may have heard.

I am not happy about this, but I'm also not throwing fits about it. I spent more time in my Reader tab than any other by a huge margin. I'm sure there's some other service out there that will fill the void. Google is also being pretty considerate about this: they're giving users several months warning and they've made tools available to export your subscription data.

RWCG has had a couple of posts taking the wind out the sails of all the people who are tearing out their hair and rending their garments and casting the evil eye towards Mountain View. Overall I think he's on the right track, but this post doesn't quite line up for me:

RWCG | Crimson Reach | Google Reader: The end of the ‘free’ era

How dare Google shut down a service I made heavy use of for years and years and paid nothing whatsoever for! [...] I wonder if this event marks the end of the ‘free’ era. You know the free era, it’s the era where this was the prevailing business philosophy:

1. Drive competing services out of business with a free service (subsidized by a profitable product).
2. Cancel free service.
3. ???

Actually no, that’s still snarky. It’s more like this:

1. Company gives away something for free.
2. People like it and use it.
3. This makes people think the company is cool. Their friend.
4. When it goes public, they buy its stock, cuz it’s so cool, and everyone they know uses and likes it. Surely that’s gotta mean something, stock-wise.
5. People continue to use the free thing and come to not only rely on it but expect it as their birthright.
6. ...
7. Profit?

According to this philosophy, giving away cool stuff for free was the wave of the future. It’s what all smart, cool companies did. Only backward knuckle-dragging idiots couldn’t figure out how this added up to a business model. Economic realities were no longer reality.

I think he's short-changing the business potential of a product ("product"?) like Google Reader. There's no direct line between "make Reader & give it away free" and "profit," but this approach still has some uses.

1. Yes, it makes people think you're cool and friendly and not-evil. Many firms do things for that reason alone. They spend billions on things way outside their core competencies just so people think they're cool and friendly. Isn't that the entire point of the "Corporate Social Responsibility" fad?

Google paying its employees to create Reader is in it's wheelhouse; it makes sense. Far more sense than, for example, Chrysler paying its employees to lay bathroom tile in poor neighborhoods.

2. Providing services like Reader makes Google look cool to people generally, but more importantly it makes them look cool to geeks. It's a punchline that a firm's number one asset is it's people, but that's pretty true about Google. They can do what they do because they get the pick of the litter of hackers.

I went to career fair my CS department sponsored a month or so ago. The line for the Google table was literally out the door. Most people in my graduate program are angling for academic jobs. Google is one of maybe four private companies that people will be impressed you're interviewing with.

3. Projects like Reader not only motivate applicants, they motivate employees.

Talent and productivity are extremely unevenly distributed in coders. The best are many orders of magnitude better than the median; the bottom decile (conservatively) have negative productivity. You usually don't get the best by offering them orders of magnitude more money, you get them by giving them cool problems to work on.

If you're excited about spending some time developing X, there's a good chance Google will let you do that. (At least in comparison to if you were working at Initech.) What's more, there's a chance Google will roll out X to millions of people, like they did with Reader. I can't stress enough how big of a motivator that can be.

4. Google's strategy for a while has been that anything they do to make people want to use the internet more is good for them, because they capture a dominant slice of the ad revenue online. More people spending more time online is better for Google, period. Reader fits into that. That's not a strategy that will work forever, or for many (any?) other companies. It can also be used to justify a lot of wasted "investments." But it's also true.

5. Google lives and dies off of data. Reader could have been generating that for them. I have no idea how much they actually learned from people's reading habits, if anything, but it had the potential to be a goldmine.

If you can predict people's age, race, religion, political party and drug use only using publicly available "likes" on Facebook, what could you do with my online reading habits? (Answer: sooooo much.)


I have no idea if it was a good idea or a bad one for Google to shut off Reader. I'm skeptical, but I realize I have none of the facts. I can't imagine it would cost that much to keep it running as is, especially compared to their other projects. I'm not sure what better use they have for the resources they're redeploying. I'm curious that they didn't even try to make it ad supported. Hell, I would have even paid directly to keep using it, and I pay for approximately zero online subscriptions.

Again, I don't know what they know. But I do know that "there's no revenue from this free product; let's shut it down" should not be the beginning and end of this decision making.

Posted in Business / Economics, CS / Science / Tech / Coding | Tagged , | Leave a comment

Computational Cartography & Gerrymandering

My brief recent dabbling with computational cartography (see Part 1 & Part 2) was inspired by Neil Freeman. His project was initiated as a (speculative) reform to the Electoral College.


Incidentally, this is a great idea. Every four years when whichever party looses the White House starts complaining about how silly the Electoral College is I feel compelled to point out that it's actually an excellent way to maximize the expected impact of any one voter's vote. The catch is that this requires states to have equally-weighted populations, which obviously we don't.

I just ran a quick Monte Carlo to check the numbers on this. Let's say you voted in 10 million elections in your life, and in each one there were 100 million voters. (This is obviously way too many elections, but I wanted something that didn't require me to report the results in scientific notation.) Let's further stipulate that every one of those races was close: a dead heat with a standard deviation of 1%. In this idealized case you could expect to be the tie-breaking vote — thus having your vote "matter" — in less than four of them. Four in ten million.

cnt = 10000000;
pop = 100000000;
for i=1:1000,
  results = round(random('normal',n/2,pop*.01,[cnt 1]));
  ties=[ties;sum(results==n/2)];
end;
mean(ties)     => ans = 3.9160

boston-red-sox-world-series-trophy

If, on the other hand, there were 51 states of 1,960,784 voters each, you would be the tie-breaker for your state 203 times. This would only matter if the other 50 states were tied 25/25. You can expect this to happen 11.2% of the time. Which means your chance of being the deciding vote goes from 4-in-10,000,000 to 23-in-10,000,000 which means your vote "matters" more than five times as much in the (modified) Electoral College system than it does in the winner-take-all system.

All of this is far too numerical to explain to someone in casual conversation, so my standard defense of the Electoral College is to agree that we should abolish it as soon as the winner of the World Series is decided by whichever teams scores the most runs in 63 innings rather than the team which wins four of seven games. The reasons to prefer both the Electoral College and best-of-seven champions are the same, but I'm already way too far off track to cover that now.


Okay, back to the point. Equally-sized states would be cool. But the chances of us redrawing state boundaries is zero.

Despite this, I think an automated system to draw boundaries for equally populated regions is far more than just an exercise. Why? Because while our state boundaries are effectively carved in stone by tradition — or in the case of my home of Maryland, literally carved in stone — there are political borders that are re-drawn with regularity: congressional districts.

That's how we end up with absurdity like this:

Maryland 3rd
Maryland's (totally reasonable, not at all farcical) 3rd Congressional District

Gerrymandering was one of the chief complaints than Glenn Reynolds made in his recent EconTalk appearance:

I'm also pretty much agnostic on term limits. I used to be opposed to them. And as I see how effective gerrymandering is, which I think is one of the reasons people hate Congress but love their Congressmen, or at least tolerate their Congressmen, because of gerrymandering, I think that's something that is difficult to deal with and probably wouldn't be helped that much by term limits. If you've got a gerrymandered district, especially with modern gerrymandering technology, the new guy who gets elected in it is likely to look an awful lot like the old guy because the district is so one-dimensional.

Taking the "especially with modern gerrymandering technology" as a given, how can we use technology to pre-empt this trend? It's usually best to address technical problems with technical solutions rather than social ones.

One option is to draw the boundaries the same way we do now, but reject them if their score on some measure falls below a certain acceptable level. For example, IDV Solutions has ranked districts by perimeter/sqrt(area). This is, by their own admission, a crude way to do things. To name just one problem, it favors districts that fall on state borders like the Mason-Dixon Line over those that fall on borders like the Potomac River.

Other measures are possible, obviously. One that I think would be useful is based on nearest neighbors. In a non-gerrymandered map the vast majority of addresses will have all of their k nearest neighbors in the same district. As districts become more complex, more people will be have large proportions of their neighbors in other districts. If you limit the neighbors to those within the state, which is reasonable, you minimize the Mason-Dixon/Potomac problem.

Another measure, which I've thought out less fully, would be to measure the Kolmogorov Complexity of the map. The descriptive length of the Maryland 3rd is... well "long" doesn't cut it. Compare that to the former definition of the Maryland 1st: "the Eastern Shore up to the Susquehanna River." The Kolmogorov Complexity itself is incomputable, but I think we have very good proxies in this case. Whoever draws the maps must be able to present in writing, with no graphics, an unambiguous definition of the districts in less than k words. Or alternately, must be able to present them orally to a judge, spoken and without visual aids, in c minutes.

There's a general principle here: If you can't explain it succinctly, it's a bad rule.

Another approach might be to measure the curvature or the fractal dimension of boundaries between districts. I've thought even less about this though.

The approach of capping some metric has the option of being simple, objective, and easily implementable. Whatever partisan committee which now draws the borders presents their map to an adjudicator, who passes it through an established, preferably open-source, program which calculates the metrics. If they're above some pre-established threshold the map is rejected. This is already an improvement over the subjective way we do things now.

A second option, suggested by David Friedman, is to have the partisan committee submit not a finished map, but a program to generate a map. The inputs to the program could contain population data, municipal boundaries, etc. but no information about party affiliation or past voting records, or any proxies for them. There would also be a cap on program length to keep anyone from submitting what is essentially just a precomputed lookup table.

Given the low numeracy I see in most legislators, I think it would be a hard sell to get them produce such a program, but the idea has tremendous appeal to me. The whole problem is to remove subjectivity and make the decision on the basis only of explicitly acceptable criteria, which is exactly what computers do: they're entirely dispassionate and the only know what you tell them.

We could take this one step further and simply codify the program to be used rather than having partisan committees writing their own after every census. I find a technocratic Bureau of Districting Algorithms much easier to imagine than local politicians wrestling with GIS algorithms.

Posted in CS / Science / Tech / Coding | Tagged , | Leave a comment

Armen Alchian & Unnecessary Mathematical Fireworks

Cato Daily Podcast :: Remembering Armen Alchian

Don Boudreaux discussing Armen Alchian's preference for clear prose over "mathematical pyrotechnics" reminded me of a few neural networks researchers I know. I won't name names, because it wasn't a favorable comparison. There's far too much equation-based whizz-bangery going on in some papers.

I use to think the problem was insufficient sophistication in my own math background, but I've recently heard independently from two very smart people in our Applied Math/Scientific Computing program that they also find the math in a lot of these papers to be more of an obfuscating smoke screen than a clarifying explication. If they find it hard to follow I've got good reason to believe the problem isn't just me.

Posted in Business / Economics, CS / Science / Tech / Coding | Tagged , | Leave a comment

Computational Cartography (Part 2)

In the previous post I talked a bit about how one might build a system which would algorithmically divide the US into 50 states of approximately equal populations.

I've messed around with the problem a bit and developed a very rough approximation. Actually, I don't even feel comfortable calling this an approximation. But it is at least the scaffolding you could build an eventual solution around.

auto sized states - after

It imports the relevant data from the Census (after a lot of PitA data cleaning) and throws it up on the screen, connecting neighboring counties. Then it picks 50 random counties to be the centers of the states and runs a k-means update on them (weighted by the county populations) to iterate to a better solution. The only wrinkle to a basic clustering is that it uses the population of a state to set a z-coordinate value for each centroid when determining the distance to each point. States with more people in them are represented higher up, making them effectively smaller.

The image above is what you end up with; here's where the system started:

auto sized states - before

The ratio of the most populous state to the least in America now is 66:1. As simple as this system is, some of the runs I tested on got that ratio down to 11:1.

There's a thousand things to do to this to make it an actual solution to the problem of automatically re-drawing state boundaries. I need to do things from the small (use spherical geometry to accurately calculate the distance between pairs of lat/long coordinates), to the medium (link AK/HI/PR to the mainland better), to the major (balance state populations more explicitly). Nevertheless, I think this is a useful framework in to build some intelligent decision making on top of.

Posted in CS / Science / Tech / Coding | Tagged | 1 Comment

Computational Cartography (Part 1)

In the sixth grade my Social Studies teacher lent me a book which had a whole section devoted to alternative maps of the US. What would it look like with a dozen states? Or 200? What if state lines were drawn according to climate and vegetation patterns, or ancestral ethnicities, or any other speculative idea? I'm fascinated by maps generally, and these were no exception.

Here's a example of a speculative map of the US by Neil Freeman that's been making the rounds in the last few days:

The United States redrawn as Fifty States with Equal Population
"Electoral college reform," Neil Freeman

My first reaction to seeing this was: Cool. Now how to do I automate this? Those two thoughts go together in my head a lot. Rather than just idly speculating how to get a machine to do this, I'm actually going to give it a try. Or at least give a try at a first approximation.

Freeman's method was initialized algorithmically, but a lot of the decisions were subjective. He also uses a lot of data (drainage basins, commuting patterns) that would be too much of a pain for me to track down an integrate for this sort of spare-time project.

I'm starting by doing everything on a county-by-county basis, which is a useful limit to work with. Without limiting myself to some relatively coarse-grained pre-existing boundaries the computational possibilities would be staggeringly vast, and the result would probably look too foreign anyway. Limiting things to the 3,033 counties and county-equivalents (Louisianan parishes, Alaskan boroughs, Virginian independent cities, ...) will make things more manageable.

I'll need population data for each county, which I can get from the Census Bureau. I'll also need an adjacency graph of counties, which I thought would be tricky, but it turns out the Census Bureau has me covered their too. The location of each county is given in that first link. For the final rendering I'll need the shape of each county. Location is simple, but I'll have to do some digging for shape.

Methods... methods... hmmm.

A clustering approach seems like a natural way to start. Something like k-means where you start with some arbitrary, incorrect centers for each state, and move them around until each state is approximately equally sized. That would at least give you something to refine from.

K-means is related to Learning Vector Quantization, which was an ancestor of Self-Organizing Maps, so that's another direction to explore. I'm not sure how you'd enforce the population similarity complaint off the top of my head. Or, for that matter, how you'd want to encode any of the information in feature space. (This 2011 paper by Skupin and Esperbé seems like a solid lead [pdf].) I'm not sure how an SOM would work for this but I have a strong feeling there's potential down that road.

There's definitely a greedy hill-climbing approach that could be easily implemented. Which means there's a Simulated Annealing approach as well. No way of knowing a priori how effective either would be.

Those techniques — and many others — will need a fitness function. It would be simple enough to write one based on the inequality in state populations. (Ratio of smallest to largest, Gini coefficient, ...) There should be a penalty for non-compact states, perhaps based on perimeter-to-area raio. But that may cause a problem for states along coast lines or rivers. Perhaps use the adjacency graph and penalize states with large diameters, weighted by their actual distance in geographic-space, relative to the number of counties. In any event, Hawaii and Alaska may need special attention for any distance-based rules.

Genetic Algorithms might come in handy, especially if you already had a reasonable starting point to refine from. How would a chromosome be represented though? Each county is one gene, taking available values of 1-50? The state space would be huge. Something better than the naive approach would be needed.

I'm sure there are notes in either my Computational Geometry or GIS notebooks that would come in handy. I'll have to flip through those next time I'm in the lab. Maybe my Algorithmic Game Theory notes as well; I seem to remember some relevant problems on siting cell towers that may be adaptable.

This should give me plenty to get started with. I've got a lot of other irons in the fire, but hopefully in a few days I'll be back with another post laying out my preliminary efforts.

Posted in CS / Science / Tech / Coding | Tagged | 2 Comments