Tag Archives: computer science

Kaggle Black Box

This is the second machine learning competition hosted at Kaggle that I've gotten serious about entering and sunk time into only to be derailed by a paper deadline. I'm pretty frustrated. Since I didn't get a chance to submit anything for the contest itself, I'm going to outline the approach I was trying here.

First a bit of background on this particular contest. The data is very high dimensional (1875 features) and multicategorical (9 classes). You get 1000 labeled training points, which isn't nearly enough to learn a good classifier on this data. In addition you get ~130000 unlabeled points. The goal is to leverage all the unlabeled data to be able to build a decent classifier out of the labeled data. To top it off you have no idea what the data represents, so it's impossible to use any domain knowledge.

I saw this contest a couple of weeks ago shortly after hearing a colleague's PhD proposal. His topic is the building networks of Kohonen Self-Organizing Maps for time series data, so SOMs are where my mind went first. SOMs are a good fit for this task: they can learn on labeled or unlabeled data, and they're excellent at dimensionality reduction.

An SOM of macroeconomic features. From Sarlin, "Exploiting the self-organizing financial stability map," 2013.
An SOM of macroeconomic features. From Sarlin, "Exploiting the self-organizing financial stability map," 2013.

My approach was to use the unlabeled training data to learn a SOM, since they lend themselves well to unsupervised learning. Then I passed the labeled data to the SOM. The maximally active node (i.e. the node whose weight vector best matches the input vector, aka the "best matching unit" or BMU) got tagged with the class of that training sample. Then I could repeat with the test data, and read out the class(es) tagged to the BMU for each data point.

So far that's simple enough, but there is far too much data to learn a SOM on efficiently, ((I also ran up against computational constraints here. I'm using almost every CPU cycle (and most of the RAM) I can get my hands on to run some last-minute analysis for the aforementioned paper submission, so I didn't have a lot of resources left over to throw at this. To top it off there's a bunch of end-of-semester server maintenance going on which both took processors out of the rotation and prevented me from parallelizing this the way I wanted.)) so I turned to my old ensemble methods.

[1] SOM bagging. The most obvious approach in many ways. Train each network on only a random subset of the data. The problem here is that any reasonable fraction of the data is still too big to get into memory. (IIRC Breiman's original Bagging paper used full boostraps, i.e. resamples the same size as the original set and even tested using resamples larger than the original data. That's not an option for me.) I could only manage 4096 data points (a paltry 3% of the data set) in each sample without page faulting. (Keep in mind again that a big chunk of this machine's memory was being used on my actual work.)

[2] SOM random dendrites. Like random forests, use the whole data set but only select a subset of the features for each SOM to learn from. I could use 64 of 1985 features at a time. This is also about 3%; the standard is IIRC more like 20%.

In order to add a bit more diversity to ensemble members I trained each for a random number of epochs between 100 and 200. There are a lot of other parameters that could have been adjusted to add diversity: smoothing, distance function and size of neighborhoods, size of network, network topology, ...

This is all pretty basic. There tricky part is combining the individual SOM predictions. For starters, how should you make a prediction with a single SOM? The BMU often had several different classes associated with it. You can pick whichever class has a plurality, and give that network's vote to that class. You can assign fractions of its vote in proportion to the class ratio of the BMU. You can take into account the distance between the sample of the BMU, and incorporate the BMU's neighbors. You can use a softmax or other probabilistic process. You can weight nodes individually or weight the votes of each SOM. This weighting can be done the traditional way (e.g. based on accuracy on a validation set) or in a way that is unique to the SOM's competitive learning process (e.g. how many times was this node the BMU? what is the distance in weight-space between this node and its neighbors? how much has this node moved in the final training epochs?).

At some point I'm going to come back to this. I have no idea if Kaggle keeps the infrastructure set up to allow post-deadline submissions, but I hope they do. I'd like to get my score on this just to satisfy my own curiosity.


This blackbox prediction concept kept cropping up in my mind while reading Nate Silver's The Signal and the Noise. We've got all these Big Questions where we're theoretically using scientific methods to reach conclusions, and yet new evidence rarely seems to change anyone's mind.

Does Medicaid improve health outcomes? Does the minimum wage increase unemployment? Did the ARRA stimulus spending work? In theory the Baicker et al. Oregon study, Card & Krueger, and the OMB's modeling ought to cause people to update beliefs but they rarely do. Let's not even get started on the IPCC, Mann's hockey stick, etc.

So here's what I'd like to do for each of these supposedly-evidence-based-and-numerical-but-not-really issues. Assemble an expert group of econometricians, modelers, quants and so on. Give them a bunch of unlabeled data. They won't know what problem they're working on or what any of the features are. Ask them to come up with the best predictors they can.

If they determine minimum wages drive unemployment without knowing they're looking at economic data then that's good evidence the two are linked. If their solution uses Stanley Cup winners but not atmospheric CO2 levels to predict tornado frequency then that's good evidence CO2 isn't a driver of tornadoes.

I don't expect this to settle any of these questions once-and-for-all — I don't expect anything at all will do that. There are too many problems (who decides what goes in the data set or how it's cleaned or scaled or lagged?). But I think doing things double-blind like this would create a lot more confidence in econometric-style results. In a way it even lessens the data-trawling problem by stepping into the issue head-on: no more doubting how much the researchers just went fishing for any correlation they could find, because we know that's exactly what they did, so we can be fully skeptical of their results.

Posted in Business / Economics, CS / Science / Tech / Coding | Tagged , , | Leave a comment

Reading List for 28 May 2013

For Science!

busy_sciencing
This is me right now seemingly all the time.

Patrick Morrison & Emerson Murphy-Hill :: Is Programming Knowledge Related To Age? An Exploration of Stack Overflow [pdf]

As a CS guy who's tip-toed into psychology here and there I would offer Morrison & Murphy-Hill this advice: tread very, very lightly when making claims regarding the words "knowledge" and especially "intelligence."

Playing_forever

TetrisConcept.net :: Playing Forever

I'm glad I didn't know about this in the winter of 2003, when I engaged in intense bouts of Tetris as a weird form of post-modern zazen. I still remember the guy who used to sit in front of me in Linear Algebra wore a tattersall shirt every single class, and I would see tetrominos cascading down his back.

RWCG :: What Brown-Vitter are asking for

This is why I want legislators & regulators who have played some strategy games. I want people making rules who have the habit of thinking, "If I do this, what is the other guy going to do? Surely he won't simply keep doing the things he was doing before I changed the environment. And surely not the exact the thing that I hope he does. What if he responds by...?"

Stephen Landsburg :: Seven Trees in One

We started with a weird pseudo-equation, manipulated it as if it were meaningful, transformed it into a series of statements that were either meaningless or clearly false, and out popped something that happened to be true. What Blass essentially proved (and Fiore and Leinster generalized) is, in effect, is that this is no coincidence. More specifically, they’ve proved in a very broad context that if you manipulate this kind of equation, pretending that sets are numbers and not letting yourself get ruffled by the illegitimacy of everything you’re doing, the end result is sure to be either a) obviously false or b) true.

Scott Weingart :: Friends don’t let friends calculate p-values (without fully understanding them)

My (very briefly stated) problem with p-values is that they combine size-of-effect and effort-in-experiment into one scalar. (This has been in the news a lot lately with the Oregon Medicaid study. Was the effect of Medicaid on blood pressure, glucose levels, etc. insignificant because Medicaid doesn't help much or because the sample size was too small? Unsurprising peoples' answers to this question are perfectly correlated with all of their prior political beliefs.)

One of the pitfalls of computational modeling is that it allows researchers to just keeping churning out simulation runs until their results are "significant." Processor cycles get shoveled into the model's maw until you have enough results to make even a tiny observed effect fit in under that magical p=0.05 limit. In theory everyone knows this isn't kosher, but "in theory" only takes us so far.

Colin Eatock :: A North American's Guide to the use and abuse of the modern PhD

Eatock specifically means the use and abuse of the letters "PhD" as a postnominal, and the appellation "Doctor," not uses/abuses of doctoral programs eo ipso.

I'm not big on titles ("The rank is but the guinea's stamp / The Man's the gowd for a' that"). Once I've defended I'll probably make one hotel reservation as "Dr. Sylvester" just so I've done it and gotten it out of my system.

I am irked by people claiming that a non-medical doctorate is somehow "not real" though. "Doctor," like most words, has several meanings. What kind of semiotic/linguistic authority are they to declare which one is "real" and which isn't? Thanks, but they can leave their self-serving grammatical prescriptivism out of this.

Scott Aaronson :: D-Wave: Truth finally starts to emerge

Suppose that... it eventually becomes clear that quantum annealing can be made to work on thousands of qubits, but that it’s a dead end as far as getting a quantum speedup is concerned. Suppose the evidence piles up that simulated annealing on a conventional computer will continue to beat quantum annealing, if even the slightest effort is put into optimizing the classical annealing code. If that happens, then I predict that the very same people now hyping D-Wave will turn around and—without the slightest acknowledgment of error on their part—declare that the entire field of quantum computing has now been unmasked as a mirage, a scam, and a chimera. The same pointy-haired bosses who now flock toward quantum computing, will flock away from it just as quickly and as uncomprehendingly. Academic QC programs will be decimated, despite the slow but genuine progress that they’d been making the entire time in a “parallel universe” from D-Wave.

I think Aaronson is right to worry about that possibility. That's essentially what caused the "AI Winter." I'd hate to see that happen to QC.

Posted in Reading Lists | Tagged , | Leave a comment

Pi

The Economist :: Babbage Blog :: Humble Pi

The Raspberry Pi is the brainchild of a couple of computer scientists at Cambridge University. Back in 2006, they lamented the decline in programming skills among applicants for computer-science courses. ... Over the past ten years, computer-science students have gone from arriving at university with a knowledge of several assembly and high-level programming languages to having little more than a working knowledge of HTML, Javascript and perhaps PHP—simple tools for constructing web sites. To learn a computer language, “you’ve got to put in your 10,000 hours,” says Dr Upton. “And it’s a lot easier if you start when you’re 18.” Some would say it is even better to start at 14.

The problem is not a lack of interest, but the lack of cheap, programmable hardware for teenagers to cut their teeth on. For typical youngsters, computers have become too complicated, too difficult to open (laptops especially) and alter their settings, and way too expensive to tinker with and risk voiding their warranty by frying their innards.

I don't see the connection between learning to code and having super-cheap hardware. Back when I was a kid learning to program I actually had to pay real money for a compiler. (Anyone else remember Borland Turbo C++?) Now you're tripping over free languages and environments to use, including many that run entirely through your browser so there's zero risk to your machine.

Honestly how many teens are going to go full-David Lightman and be doing serious enough hacking that their hardware is at risk? Is the goal to make sure teens have the opportunity to start learning to code before college, or to give them hardware to tinker with? Those are both fine goals. Being a software guy I'd put more weight on the former, but the important thing is that the way to accomplish either are completely different.

The Pi is a great way to meet the goal of giving people cheap hardware to experiment with. But if the goal is to give kids an opportunity to start racking up their 10k hours in front of an interpeter or compiler then projects like Repl.it are a lot better. (Repl.it has in-browser interpreters for JavaScript, Ruby, Python, Scheme and a dozen other languages.)

For starters, [your correspondant] plans to turn his existing Raspberry Pi into a media centre. By all accounts, Raspbmc—a Linux-based operating system derived from the XBox game-player’s media centre—is a pretty powerful media player. The first task, then, is to rig the Raspberry Pi up so it can pluck video off the internet, via a nearby WiFi router, and stream it direct to a TV in the living room. Finding out not whether, but just how well, this minimalist little computer manages such a feat will be all part of the fun.

I did this exact project about a month ago, and couldn't be more pleased with either the results or the fun of getting it to work. I still have to tinker with some things: the Vimeo plugin won't log into my account, and I need to build a case. Other than that, I wish I had done this a long time ago.

Posted in CS / Science / Tech / Coding | Tagged , , , , | Leave a comment

Computational Cartography & Gerrymandering

My brief recent dabbling with computational cartography (see Part 1 & Part 2) was inspired by Neil Freeman. His project was initiated as a (speculative) reform to the Electoral College.


Incidentally, this is a great idea. Every four years when whichever party looses the White House starts complaining about how silly the Electoral College is I feel compelled to point out that it's actually an excellent way to maximize the expected impact of any one voter's vote. The catch is that this requires states to have equally-weighted populations, which obviously we don't.

I just ran a quick Monte Carlo to check the numbers on this. Let's say you voted in 10 million elections in your life, and in each one there were 100 million voters. (This is obviously way too many elections, but I wanted something that didn't require me to report the results in scientific notation.) Let's further stipulate that every one of those races was close: a dead heat with a standard deviation of 1%. In this idealized case you could expect to be the tie-breaking vote — thus having your vote "matter" — in less than four of them. Four in ten million.

cnt = 10000000;
pop = 100000000;
for i=1:1000,
  results = round(random('normal',n/2,pop*.01,[cnt 1]));
  ties=[ties;sum(results==n/2)];
end;
mean(ties)     => ans = 3.9160

boston-red-sox-world-series-trophy

If, on the other hand, there were 51 states of 1,960,784 voters each, you would be the tie-breaker for your state 203 times. This would only matter if the other 50 states were tied 25/25. You can expect this to happen 11.2% of the time. Which means your chance of being the deciding vote goes from 4-in-10,000,000 to 23-in-10,000,000 which means your vote "matters" more than five times as much in the (modified) Electoral College system than it does in the winner-take-all system.

All of this is far too numerical to explain to someone in casual conversation, so my standard defense of the Electoral College is to agree that we should abolish it as soon as the winner of the World Series is decided by whichever teams scores the most runs in 63 innings rather than the team which wins four of seven games. The reasons to prefer both the Electoral College and best-of-seven champions are the same, but I'm already way too far off track to cover that now.


Okay, back to the point. Equally-sized states would be cool. But the chances of us redrawing state boundaries is zero.

Despite this, I think an automated system to draw boundaries for equally populated regions is far more than just an exercise. Why? Because while our state boundaries are effectively carved in stone by tradition — or in the case of my home of Maryland, literally carved in stone — there are political borders that are re-drawn with regularity: congressional districts.

That's how we end up with absurdity like this:

Maryland 3rd
Maryland's (totally reasonable, not at all farcical) 3rd Congressional District

Gerrymandering was one of the chief complaints than Glenn Reynolds made in his recent EconTalk appearance:

I'm also pretty much agnostic on term limits. I used to be opposed to them. And as I see how effective gerrymandering is, which I think is one of the reasons people hate Congress but love their Congressmen, or at least tolerate their Congressmen, because of gerrymandering, I think that's something that is difficult to deal with and probably wouldn't be helped that much by term limits. If you've got a gerrymandered district, especially with modern gerrymandering technology, the new guy who gets elected in it is likely to look an awful lot like the old guy because the district is so one-dimensional.

Taking the "especially with modern gerrymandering technology" as a given, how can we use technology to pre-empt this trend? It's usually best to address technical problems with technical solutions rather than social ones.

One option is to draw the boundaries the same way we do now, but reject them if their score on some measure falls below a certain acceptable level. For example, IDV Solutions has ranked districts by perimeter/sqrt(area). This is, by their own admission, a crude way to do things. To name just one problem, it favors districts that fall on state borders like the Mason-Dixon Line over those that fall on borders like the Potomac River.

Other measures are possible, obviously. One that I think would be useful is based on nearest neighbors. In a non-gerrymandered map the vast majority of addresses will have all of their k nearest neighbors in the same district. As districts become more complex, more people will be have large proportions of their neighbors in other districts. If you limit the neighbors to those within the state, which is reasonable, you minimize the Mason-Dixon/Potomac problem.

Another measure, which I've thought out less fully, would be to measure the Kolmogorov Complexity of the map. The descriptive length of the Maryland 3rd is... well "long" doesn't cut it. Compare that to the former definition of the Maryland 1st: "the Eastern Shore up to the Susquehanna River." The Kolmogorov Complexity itself is incomputable, but I think we have very good proxies in this case. Whoever draws the maps must be able to present in writing, with no graphics, an unambiguous definition of the districts in less than k words. Or alternately, must be able to present them orally to a judge, spoken and without visual aids, in c minutes.

There's a general principle here: If you can't explain it succinctly, it's a bad rule.

Another approach might be to measure the curvature or the fractal dimension of boundaries between districts. I've thought even less about this though.

The approach of capping some metric has the option of being simple, objective, and easily implementable. Whatever partisan committee which now draws the borders presents their map to an adjudicator, who passes it through an established, preferably open-source, program which calculates the metrics. If they're above some pre-established threshold the map is rejected. This is already an improvement over the subjective way we do things now.

A second option, suggested by David Friedman, is to have the partisan committee submit not a finished map, but a program to generate a map. The inputs to the program could contain population data, municipal boundaries, etc. but no information about party affiliation or past voting records, or any proxies for them. There would also be a cap on program length to keep anyone from submitting what is essentially just a precomputed lookup table.

Given the low numeracy I see in most legislators, I think it would be a hard sell to get them produce such a program, but the idea has tremendous appeal to me. The whole problem is to remove subjectivity and make the decision on the basis only of explicitly acceptable criteria, which is exactly what computers do: they're entirely dispassionate and the only know what you tell them.

We could take this one step further and simply codify the program to be used rather than having partisan committees writing their own after every census. I find a technocratic Bureau of Districting Algorithms much easier to imagine than local politicians wrestling with GIS algorithms.

Posted in CS / Science / Tech / Coding | Tagged , | Leave a comment

Armen Alchian & Unnecessary Mathematical Fireworks

Cato Daily Podcast :: Remembering Armen Alchian

Don Boudreaux discussing Armen Alchian's preference for clear prose over "mathematical pyrotechnics" reminded me of a few neural networks researchers I know. I won't name names, because it wasn't a favorable comparison. There's far too much equation-based whizz-bangery going on in some papers.

I use to think the problem was insufficient sophistication in my own math background, but I've recently heard independently from two very smart people in our Applied Math/Scientific Computing program that they also find the math in a lot of these papers to be more of an obfuscating smoke screen than a clarifying explication. If they find it hard to follow I've got good reason to believe the problem isn't just me.

Posted in Business / Economics, CS / Science / Tech / Coding | Tagged , | Leave a comment