This is the second machine learning competition hosted at Kaggle that I've gotten serious about entering and sunk time into only to be derailed by a paper deadline. I'm pretty frustrated. Since I didn't get a chance to submit anything for the contest itself, I'm going to outline the approach I was trying here.
First a bit of background on this particular contest. The data is very high dimensional (1875 features) and multicategorical (9 classes). You get 1000 labeled training points, which isn't nearly enough to learn a good classifier on this data. In addition you get ~130000 unlabeled points. The goal is to leverage all the unlabeled data to be able to build a decent classifier out of the labeled data. To top it off you have no idea what the data represents, so it's impossible to use any domain knowledge.
I saw this contest a couple of weeks ago shortly after hearing a colleague's PhD proposal. His topic is the building networks of Kohonen Self-Organizing Maps for time series data, so SOMs are where my mind went first. SOMs are a good fit for this task: they can learn on labeled or unlabeled data, and they're excellent at dimensionality reduction.
My approach was to use the unlabeled training data to learn a SOM, since they lend themselves well to unsupervised learning. Then I passed the labeled data to the SOM. The maximally active node (i.e. the node whose weight vector best matches the input vector, aka the "best matching unit" or BMU) got tagged with the class of that training sample. Then I could repeat with the test data, and read out the class(es) tagged to the BMU for each data point.
So far that's simple enough, but there is far too much data to learn a SOM on efficiently, 1 so I turned to my old ensemble methods.
 SOM bagging. The most obvious approach in many ways. Train each network on only a random subset of the data. The problem here is that any reasonable fraction of the data is still too big to get into memory. (IIRC Breiman's original Bagging paper used full boostraps, i.e. resamples the same size as the original set and even tested using resamples larger than the original data. That's not an option for me.) I could only manage 4096 data points (a paltry 3% of the data set) in each sample without page faulting. (Keep in mind again that a big chunk of this machine's memory was being used on my actual work.)
 SOM random dendrites. Like random forests, use the whole data set but only select a subset of the features for each SOM to learn from. I could use 64 of 1985 features at a time. This is also about 3%; the standard is IIRC more like 20%.
In order to add a bit more diversity to ensemble members I trained each for a random number of epochs between 100 and 200. There are a lot of other parameters that could have been adjusted to add diversity: smoothing, distance function and size of neighborhoods, size of network, network topology, ...
This is all pretty basic. There tricky part is combining the individual SOM predictions. For starters, how should you make a prediction with a single SOM? The BMU often had several different classes associated with it. You can pick whichever class has a plurality, and give that network's vote to that class. You can assign fractions of its vote in proportion to the class ratio of the BMU. You can take into account the distance between the sample of the BMU, and incorporate the BMU's neighbors. You can use a softmax or other probabilistic process. You can weight nodes individually or weight the votes of each SOM. This weighting can be done the traditional way (e.g. based on accuracy on a validation set) or in a way that is unique to the SOM's competitive learning process (e.g. how many times was this node the BMU? what is the distance in weight-space between this node and its neighbors? how much has this node moved in the final training epochs?).
At some point I'm going to come back to this. I have no idea if Kaggle keeps the infrastructure set up to allow post-deadline submissions, but I hope they do. I'd like to get my score on this just to satisfy my own curiosity.
This blackbox prediction concept kept cropping up in my mind while reading Nate Silver's The Signal and the Noise. We've got all these Big Questions where we're theoretically using scientific methods to reach conclusions, and yet new evidence rarely seems to change anyone's mind.
Does Medicaid improve health outcomes? Does the minimum wage increase unemployment? Did the ARRA stimulus spending work? In theory the Baicker et al. Oregon study, Card & Krueger, and the OMB's modeling ought to cause people to update beliefs but they rarely do. Let's not even get started on the IPCC, Mann's hockey stick, etc.
So here's what I'd like to do for each of these supposedly-evidence-based-and-numerical-but-not-really issues. Assemble an expert group of econometricians, modelers, quants and so on. Give them a bunch of unlabeled data. They won't know what problem they're working on or what any of the features are. Ask them to come up with the best predictors they can.
If they determine minimum wages drive unemployment without knowing they're looking at economic data then that's good evidence the two are linked. If their solution uses Stanley Cup winners but not atmospheric CO2 levels to predict tornado frequency then that's good evidence CO2 isn't a driver of tornadoes.
I don't expect this to settle any of these questions once-and-for-all — I don't expect anything at all will do that. There are too many problems (who decides what goes in the data set or how it's cleaned or scaled or lagged?). But I think doing things double-blind like this would create a lot more confidence in econometric-style results. In a way it even lessens the data-trawling problem by stepping into the issue head-on: no more doubting how much the researchers just went fishing for any correlation they could find, because we know that's exactly what they did, so we can be fully skeptical of their results.
- I also ran up against computational constraints here. I'm using almost every CPU cycle (and most of the RAM) I can get my hands on to run some last-minute analysis for the aforementioned paper submission, so I didn't have a lot of resources left over to throw at this. To top it off there's a bunch of end-of-semester server maintenance going on which both took processors out of the rotation and prevented me from parallelizing this the way I wanted. [↩]