Monthly Archives: May 2013

Kaggle Black Box

This is the second machine learning competition hosted at Kaggle that I've gotten serious about entering and sunk time into only to be derailed by a paper deadline. I'm pretty frustrated. Since I didn't get a chance to submit anything for the contest itself, I'm going to outline the approach I was trying here.

First a bit of background on this particular contest. The data is very high dimensional (1875 features) and multicategorical (9 classes). You get 1000 labeled training points, which isn't nearly enough to learn a good classifier on this data. In addition you get ~130000 unlabeled points. The goal is to leverage all the unlabeled data to be able to build a decent classifier out of the labeled data. To top it off you have no idea what the data represents, so it's impossible to use any domain knowledge.

I saw this contest a couple of weeks ago shortly after hearing a colleague's PhD proposal. His topic is the building networks of Kohonen Self-Organizing Maps for time series data, so SOMs are where my mind went first. SOMs are a good fit for this task: they can learn on labeled or unlabeled data, and they're excellent at dimensionality reduction.

An SOM of macroeconomic features. From Sarlin, "Exploiting the self-organizing financial stability map," 2013.
An SOM of macroeconomic features. From Sarlin, "Exploiting the self-organizing financial stability map," 2013.

My approach was to use the unlabeled training data to learn a SOM, since they lend themselves well to unsupervised learning. Then I passed the labeled data to the SOM. The maximally active node (i.e. the node whose weight vector best matches the input vector, aka the "best matching unit" or BMU) got tagged with the class of that training sample. Then I could repeat with the test data, and read out the class(es) tagged to the BMU for each data point.

So far that's simple enough, but there is far too much data to learn a SOM on efficiently,1 so I turned to my old ensemble methods.

[1] SOM bagging. The most obvious approach in many ways. Train each network on only a random subset of the data. The problem here is that any reasonable fraction of the data is still too big to get into memory. (IIRC Breiman's original Bagging paper used full boostraps, i.e. resamples the same size as the original set and even tested using resamples larger than the original data. That's not an option for me.) I could only manage 4096 data points (a paltry 3% of the data set) in each sample without page faulting. (Keep in mind again that a big chunk of this machine's memory was being used on my actual work.)

[2] SOM random dendrites. Like random forests, use the whole data set but only select a subset of the features for each SOM to learn from. I could use 64 of 1985 features at a time. This is also about 3%; the standard is IIRC more like 20%.

In order to add a bit more diversity to ensemble members I trained each for a random number of epochs between 100 and 200. There are a lot of other parameters that could have been adjusted to add diversity: smoothing, distance function and size of neighborhoods, size of network, network topology, ...

This is all pretty basic. There tricky part is combining the individual SOM predictions. For starters, how should you make a prediction with a single SOM? The BMU often had several different classes associated with it. You can pick whichever class has a plurality, and give that network's vote to that class. You can assign fractions of its vote in proportion to the class ratio of the BMU. You can take into account the distance between the sample of the BMU, and incorporate the BMU's neighbors. You can use a softmax or other probabilistic process. You can weight nodes individually or weight the votes of each SOM. This weighting can be done the traditional way (e.g. based on accuracy on a validation set) or in a way that is unique to the SOM's competitive learning process (e.g. how many times was this node the BMU? what is the distance in weight-space between this node and its neighbors? how much has this node moved in the final training epochs?).

At some point I'm going to come back to this. I have no idea if Kaggle keeps the infrastructure set up to allow post-deadline submissions, but I hope they do. I'd like to get my score on this just to satisfy my own curiosity.

This blackbox prediction concept kept cropping up in my mind while reading Nate Silver's The Signal and the Noise. We've got all these Big Questions where we're theoretically using scientific methods to reach conclusions, and yet new evidence rarely seems to change anyone's mind.

Does Medicaid improve health outcomes? Does the minimum wage increase unemployment? Did the ARRA stimulus spending work? In theory the Baicker et al. Oregon study, Card & Krueger, and the OMB's modeling ought to cause people to update beliefs but they rarely do. Let's not even get started on the IPCC, Mann's hockey stick, etc.

So here's what I'd like to do for each of these supposedly-evidence-based-and-numerical-but-not-really issues. Assemble an expert group of econometricians, modelers, quants and so on. Give them a bunch of unlabeled data. They won't know what problem they're working on or what any of the features are. Ask them to come up with the best predictors they can.

If they determine minimum wages drive unemployment without knowing they're looking at economic data then that's good evidence the two are linked. If their solution uses Stanley Cup winners but not atmospheric CO2 levels to predict tornado frequency then that's good evidence CO2 isn't a driver of tornadoes.

I don't expect this to settle any of these questions once-and-for-all — I don't expect anything at all will do that. There are too many problems (who decides what goes in the data set or how it's cleaned or scaled or lagged?). But I think doing things double-blind like this would create a lot more confidence in econometric-style results. In a way it even lessens the data-trawling problem by stepping into the issue head-on: no more doubting how much the researchers just went fishing for any correlation they could find, because we know that's exactly what they did, so we can be fully skeptical of their results.

  1. I also ran up against computational constraints here. I'm using almost every CPU cycle (and most of the RAM) I can get my hands on to run some last-minute analysis for the aforementioned paper submission, so I didn't have a lot of resources left over to throw at this. To top it off there's a bunch of end-of-semester server maintenance going on which both took processors out of the rotation and prevented me from parallelizing this the way I wanted. []
Posted in Business / Economics, CS / Science / Tech / Coding | Tagged , , | Leave a comment

"Traffic," Tom Vanderbilt

This is a good compendium. Nothing too ground-breaking here, but Vanderbilt does cover a lot of ground.

I especially liked that Vanderbilt addressed self-driving cars. Traffic was published in 2009; I didn't expect then that producers would have made as much progress towards autonomous vehicles as they have in the last four years. I am more optimistic about overcoming regulatory hurdles than I was then, but I still believe those will be bigger obstacles than any technological difficulties.

Traffic: Why We Drive the Way We Do (and What It Says About Us), Tom Vanderbilt
Traffic: Why We Drive the Way We Do (and What It Says About Us), Tom Vanderbilt

I find any serious discussion of congestion, mass transit, electric vehicles, hybrids, land use, urban planning, fuel usage, carbon emissions, etc. pretty pointless if it doesn't consider the transformative effects of autonomous vehicles. Planning a new highway or commuter rail line that's supposed to be useful for the next fifty years without considering robo-cars feels like some 1897 Jules Verne-esque proto-steampunk fantasy that predicts the next century will look just like the last one except it will have more telegraphs and longer steam trains. You might as well be sitting around in a top hat and frock coat micromanaging where you'll be putting all the stables and coal bunkers for the next five generations, oblivious to Messrs Benz, Daimler, Peugeot et al. motoring around on your lawn.

I think you can wrap most of the problems of traffic congestion up into several short, unimpeachable statements:

  1. Costs can take the form of both money and time.
  2. Lowering the cost of something means people will do more of it, ceteris paribus.
  3. Reducing traffic congestion reduces the time-cost of driving.
  4. The reduced cost of driving causes people to want to drive more, raising traffic congestion again.

Unless someone can show me one of those four statements is incorrect, I'm comfortable concluding that traffic is here to stay for the foreseeable future.

Plenty people think they have the cure for congestion: roundabouts, light rail, "livable communities," bike sharing, HOV lanes, high-density residences, abolishing free parking, mileage fees, congestion fees, etc. Some of these are good ideas, and some aren't. But I'm not taking anyone who claims to solve (or even alleviate) the traffic problem seriously unless they can address how their solution interacts with #1-4 above.

For some of the proposals the resolution is simple: they lower the time-cost but explicitly raise the monetary cost (e.g. congestion pricing, market-based rates for parking). Others don't have such an easy time of it. But either way, I'd like people to at least be able to address how they would break out of this feedback loop.

PS I once sat through an hour-long keynote by an eminent professor from MIT Sloan on modeling market penetration of alternative fuel vehicles. Half of his talk ended up being about gas shortages, both in the 1970s and after Hurricane Sandy. At no point in those thirty minutes did he once mention the word "price"! Everything I had heard about the distinction between freshwater and saltwater economics snapped into focus.

Posted in Business / Economics, Reviews | Tagged , | Leave a comment

Reading List for 28 May 2013

For Science!

This is me right now seemingly all the time.

Patrick Morrison & Emerson Murphy-Hill :: Is Programming Knowledge Related To Age? An Exploration of Stack Overflow [pdf]

As a CS guy who's tip-toed into psychology here and there I would offer Morrison & Murphy-Hill this advice: tread very, very lightly when making claims regarding the words "knowledge" and especially "intelligence."

Playing_forever :: Playing Forever

I'm glad I didn't know about this in the winter of 2003, when I engaged in intense bouts of Tetris as a weird form of post-modern zazen. I still remember the guy who used to sit in front of me in Linear Algebra wore a tattersall shirt every single class, and I would see tetrominos cascading down his back.

RWCG :: What Brown-Vitter are asking for

This is why I want legislators & regulators who have played some strategy games. I want people making rules who have the habit of thinking, "If I do this, what is the other guy going to do? Surely he won't simply keep doing the things he was doing before I changed the environment. And surely not the exact the thing that I hope he does. What if he responds by...?"

Stephen Landsburg :: Seven Trees in One

We started with a weird pseudo-equation, manipulated it as if it were meaningful, transformed it into a series of statements that were either meaningless or clearly false, and out popped something that happened to be true. What Blass essentially proved (and Fiore and Leinster generalized) is, in effect, is that this is no coincidence. More specifically, they’ve proved in a very broad context that if you manipulate this kind of equation, pretending that sets are numbers and not letting yourself get ruffled by the illegitimacy of everything you’re doing, the end result is sure to be either a) obviously false or b) true.

Scott Weingart :: Friends don’t let friends calculate p-values (without fully understanding them)

My (very briefly stated) problem with p-values is that they combine size-of-effect and effort-in-experiment into one scalar. (This has been in the news a lot lately with the Oregon Medicaid study. Was the effect of Medicaid on blood pressure, glucose levels, etc. insignificant because Medicaid doesn't help much or because the sample size was too small? Unsurprising peoples' answers to this question are perfectly correlated with all of their prior political beliefs.)

One of the pitfalls of computational modeling is that it allows researchers to just keeping churning out simulation runs until their results are "significant." Processor cycles get shoveled into the model's maw until you have enough results to make even a tiny observed effect fit in under that magical p=0.05 limit. In theory everyone knows this isn't kosher, but "in theory" only takes us so far.

Colin Eatock :: A North American's Guide to the use and abuse of the modern PhD

Eatock specifically means the use and abuse of the letters "PhD" as a postnominal, and the appellation "Doctor," not uses/abuses of doctoral programs eo ipso.

I'm not big on titles ("The rank is but the guinea's stamp / The Man's the gowd for a' that"). Once I've defended I'll probably make one hotel reservation as "Dr. Sylvester" just so I've done it and gotten it out of my system.

I am irked by people claiming that a non-medical doctorate is somehow "not real" though. "Doctor," like most words, has several meanings. What kind of semiotic/linguistic authority are they to declare which one is "real" and which isn't? Thanks, but they can leave their self-serving grammatical prescriptivism out of this.

Scott Aaronson :: D-Wave: Truth finally starts to emerge

Suppose that... it eventually becomes clear that quantum annealing can be made to work on thousands of qubits, but that it’s a dead end as far as getting a quantum speedup is concerned. Suppose the evidence piles up that simulated annealing on a conventional computer will continue to beat quantum annealing, if even the slightest effort is put into optimizing the classical annealing code. If that happens, then I predict that the very same people now hyping D-Wave will turn around and—without the slightest acknowledgment of error on their part—declare that the entire field of quantum computing has now been unmasked as a mirage, a scam, and a chimera. The same pointy-haired bosses who now flock toward quantum computing, will flock away from it just as quickly and as uncomprehendingly. Academic QC programs will be decimated, despite the slow but genuine progress that they’d been making the entire time in a “parallel universe” from D-Wave.

I think Aaronson is right to worry about that possibility. That's essentially what caused the "AI Winter." I'd hate to see that happen to QC.

Posted in Reading Lists | Tagged , | Leave a comment

"Ragnarok: The End of the Gods," A.S. Byatt

Ragnarok: The End of the Gods

This is part of the Canongate Myth Series, which has contemporary authors re-telling ancient myths.

I soaked up all the Greco-Roman mythology I could get as a kid. My parents cleverly gave me a gift-wrapped copy of D'Aulaires' Book of Greek Myths right before boarding a flight to Florida. That was an extremely effective way to keep an eight year old Jared quiet for three hours.

Despite an interest I've never immersed myself in other cultures' myths to the same degree. Having them actually presented as fiction like Byatt does here worked better than attempting to read about it as non-fiction. Previous non-fiction sources I've tried are either superficial or fractally labyrinthine. I think the framing story Byatt chose was a little superfluous though it does get points for lyricism.

Harriet Walter's narration in the audiobook version I listened to was quite good. There were several passages of extended lists of beasts and plants and such that worked much better narrated than it would have in print. What would have been skimmed over in print had a hypnotic quality when spoken. (See lyricism remark supra.)

As a whole it was certainly good enough for me to pick up other books in the series.

Posted in Reviews | Tagged | Leave a comment

Command line history

Jude Robinson :: The single most useful thing in bash

Create ~/.inputrc and fill it with this:

"\e[A": history-search-backward
"\e[B": history-search-forward
set show-all-if-ambiguous on
set completion-ignore-case on

This allows you to search through your history using the up and down arrows … i.e. type cd / and press the up arrow and you'll search through everything in your history that starts with cd /.

Wow. That is not an exaggeration at all: the most useful thing. I am so thrilled to finally be able to search my shell history the same way I can my Matlab history. I've been able to do this there for ages and my mind still hasn't caught up with not being able to do it in the shell.

If it's not clear to you why this is useful or why it pleases me, I don't think there's anything I can do to explain it. Sorry.

PS Anyone have first-hand experience with the fish shell? The autocompletions and inline, automatic syntax highlighting seem clever. I need to get around to giving it a try on one of my boxes.

Posted in CS / Science / Tech / Coding | Tagged , | Leave a comment


The Economist :: Babbage Blog :: Humble Pi

The Raspberry Pi is the brainchild of a couple of computer scientists at Cambridge University. Back in 2006, they lamented the decline in programming skills among applicants for computer-science courses. ... Over the past ten years, computer-science students have gone from arriving at university with a knowledge of several assembly and high-level programming languages to having little more than a working knowledge of HTML, Javascript and perhaps PHP—simple tools for constructing web sites. To learn a computer language, “you’ve got to put in your 10,000 hours,” says Dr Upton. “And it’s a lot easier if you start when you’re 18.” Some would say it is even better to start at 14.

The problem is not a lack of interest, but the lack of cheap, programmable hardware for teenagers to cut their teeth on. For typical youngsters, computers have become too complicated, too difficult to open (laptops especially) and alter their settings, and way too expensive to tinker with and risk voiding their warranty by frying their innards.

I don't see the connection between learning to code and having super-cheap hardware. Back when I was a kid learning to program I actually had to pay real money for a compiler. (Anyone else remember Borland Turbo C++?) Now you're tripping over free languages and environments to use, including many that run entirely through your browser so there's zero risk to your machine.

Honestly how many teens are going to go full-David Lightman and be doing serious enough hacking that their hardware is at risk? Is the goal to make sure teens have the opportunity to start learning to code before college, or to give them hardware to tinker with? Those are both fine goals. Being a software guy I'd put more weight on the former, but the important thing is that the way to accomplish either are completely different.

The Pi is a great way to meet the goal of giving people cheap hardware to experiment with. But if the goal is to give kids an opportunity to start racking up their 10k hours in front of an interpeter or compiler then projects like are a lot better. ( has in-browser interpreters for JavaScript, Ruby, Python, Scheme and a dozen other languages.)

For starters, [your correspondant] plans to turn his existing Raspberry Pi into a media centre. By all accounts, Raspbmc—a Linux-based operating system derived from the XBox game-player’s media centre—is a pretty powerful media player. The first task, then, is to rig the Raspberry Pi up so it can pluck video off the internet, via a nearby WiFi router, and stream it direct to a TV in the living room. Finding out not whether, but just how well, this minimalist little computer manages such a feat will be all part of the fun.

I did this exact project about a month ago, and couldn't be more pleased with either the results or the fun of getting it to work. I still have to tinker with some things: the Vimeo plugin won't log into my account, and I need to build a case. Other than that, I wish I had done this a long time ago.

Posted in CS / Science / Tech / Coding | Tagged , , , , | Leave a comment

Reading List for 2 May 2013

Marginal Revolution :: Tyler Cowen :: Is there a shortage of STEM workers in the United States?

Simplified analogy: I'm not bidding up the price of quadcopters. That doesn't mean that if we had more of them I wouldn't find cool stuff to do with them.

(For other takes on this see Ian Hathaway and Alex Tabarrok.)

The paper Cowen is responding to states: "The annual number of computer science graduates doubled between 1998 and 2004, and is currently over 50 percent higher than its 1998 level." Another way to describe this situation is "The annual number of CS graduates has fallen by a quarter in less than a decade." That gives a rather different spin than the authors formulation.

Taschen information graphics bookUncrate :: Information Graphics by Sandra Rendgen

Recommended. Both useful and pretty. There aren't many books I've gotten from the UMD library for work that I'm happy to leave on the coffee table.

Christopher Rowe. "The new library of Babel? Borges, digitisation and the myth of the universal library." First Monday, 18(2). 2013.

As a general rule, I'm skeptical of papers that make heavy use of vocabulary like "problematise." But another general rule is that Borges' "Library of Babel" is amazing, so...

HBR Blogs :: Grant McCracken :: Is Timex Suffering the Early Stages of Disruption?

Bloomberg :: Virginia Postrel :: Dove’s Fake New 'Real Beauty' Ads

Dove did a great job of rhetoric but then they had to go and dishonestly cloak it in the banners of Science.

Physics Buzz :: Chris Gorski :: Physicist Proposes New Way To Think About Intelligence

Wissner-Gross calls the concept at the center of the research "causal entropic forces." These forces are the motivation for intelligent behavior. They encourage a system to preserve as many future histories as possible. For example, in the cart-and-rod exercise, Entropica controls the cart to keep the rod upright. Allowing the rod to fall would drastically reduce the number of remaining future histories, or, in other words, lower the entropy of the cart-and-rod system. Keeping the rod upright maximizes the entropy. It maintains all future histories that can begin from that state, including those that require the cart to let the rod fall.

I'm not sure I buy this, but I'll hold judgement until I read the Phys Rev Lett paper. I do know this though: there are already about 10,000 different definitions of "entropy" in different fields, and that causes no end of inter-disciplinary confusion. I'm not looking forward to having to keep track of another.

Growth Matters :: Clement Wan :: Experiments in Education

io9 :: A map of U.S. roads and nothing else

map of us roads

Check out the full res version.

Posted in Reading Lists | Leave a comment

"The disposable academic"

The Economist :: The disposable academic

You know you are a graduate student, goes one quip, when your office is better decorated than your home and you have a favourite flavour of instant noodle.

Chad Hagen's "Nonsensical Infographic No. 1"
Chad Hagen's "Nonsensical Infographic No. 1"
True. And true.

Although the first has more to do with my wife and I having diverging opinons about contemporary art. I think Jared Tarbell prints and John Maeda quotes are great things to put on the wall. My wife... feels otherwise.

As to the second, my preference from among the widely-distributed brands is Maruchan Roast Chicken, but most varieties are good with a little extra curry powder, some sriracha, a bit of cilantro or spring onion, and a squeeze of lime.

(Side note: If you want to branch out on your ramen choices, check out Ramenbox.)

Even graduates who find work outside universities may not fare all that well. PhD courses are so specialised that university careers offices struggle to assist graduates looking for jobs, and supervisors tend to have little interest in students who are leaving academia.

That part is true sans caveats. My advisor is supportive of me leaving academia, but neither he now anyone else knows how to help me look for non-academic jobs. There's plenty of support if I wanted to stay in academia, and a fair amount if I wanted to be at a place like Sandia or MSR. But for the types of positions I want, I'm on my own.

Posted in Uncategorized | Tagged | Leave a comment