Category Archives: CS / Science / Tech / Coding

Kaggle Black Box

This is the second machine learning competition hosted at Kaggle that I've gotten serious about entering and sunk time into only to be derailed by a paper deadline. I'm pretty frustrated. Since I didn't get a chance to submit anything for the contest itself, I'm going to outline the approach I was trying here.

First a bit of background on this particular contest. The data is very high dimensional (1875 features) and multicategorical (9 classes). You get 1000 labeled training points, which isn't nearly enough to learn a good classifier on this data. In addition you get ~130000 unlabeled points. The goal is to leverage all the unlabeled data to be able to build a decent classifier out of the labeled data. To top it off you have no idea what the data represents, so it's impossible to use any domain knowledge.

I saw this contest a couple of weeks ago shortly after hearing a colleague's PhD proposal. His topic is the building networks of Kohonen Self-Organizing Maps for time series data, so SOMs are where my mind went first. SOMs are a good fit for this task: they can learn on labeled or unlabeled data, and they're excellent at dimensionality reduction.

An SOM of macroeconomic features. From Sarlin, "Exploiting the self-organizing financial stability map," 2013.
An SOM of macroeconomic features. From Sarlin, "Exploiting the self-organizing financial stability map," 2013.

My approach was to use the unlabeled training data to learn a SOM, since they lend themselves well to unsupervised learning. Then I passed the labeled data to the SOM. The maximally active node (i.e. the node whose weight vector best matches the input vector, aka the "best matching unit" or BMU) got tagged with the class of that training sample. Then I could repeat with the test data, and read out the class(es) tagged to the BMU for each data point.

So far that's simple enough, but there is far too much data to learn a SOM on efficiently,1 so I turned to my old ensemble methods.

[1] SOM bagging. The most obvious approach in many ways. Train each network on only a random subset of the data. The problem here is that any reasonable fraction of the data is still too big to get into memory. (IIRC Breiman's original Bagging paper used full boostraps, i.e. resamples the same size as the original set and even tested using resamples larger than the original data. That's not an option for me.) I could only manage 4096 data points (a paltry 3% of the data set) in each sample without page faulting. (Keep in mind again that a big chunk of this machine's memory was being used on my actual work.)

[2] SOM random dendrites. Like random forests, use the whole data set but only select a subset of the features for each SOM to learn from. I could use 64 of 1985 features at a time. This is also about 3%; the standard is IIRC more like 20%.

In order to add a bit more diversity to ensemble members I trained each for a random number of epochs between 100 and 200. There are a lot of other parameters that could have been adjusted to add diversity: smoothing, distance function and size of neighborhoods, size of network, network topology, ...

This is all pretty basic. There tricky part is combining the individual SOM predictions. For starters, how should you make a prediction with a single SOM? The BMU often had several different classes associated with it. You can pick whichever class has a plurality, and give that network's vote to that class. You can assign fractions of its vote in proportion to the class ratio of the BMU. You can take into account the distance between the sample of the BMU, and incorporate the BMU's neighbors. You can use a softmax or other probabilistic process. You can weight nodes individually or weight the votes of each SOM. This weighting can be done the traditional way (e.g. based on accuracy on a validation set) or in a way that is unique to the SOM's competitive learning process (e.g. how many times was this node the BMU? what is the distance in weight-space between this node and its neighbors? how much has this node moved in the final training epochs?).

At some point I'm going to come back to this. I have no idea if Kaggle keeps the infrastructure set up to allow post-deadline submissions, but I hope they do. I'd like to get my score on this just to satisfy my own curiosity.


This blackbox prediction concept kept cropping up in my mind while reading Nate Silver's The Signal and the Noise. We've got all these Big Questions where we're theoretically using scientific methods to reach conclusions, and yet new evidence rarely seems to change anyone's mind.

Does Medicaid improve health outcomes? Does the minimum wage increase unemployment? Did the ARRA stimulus spending work? In theory the Baicker et al. Oregon study, Card & Krueger, and the OMB's modeling ought to cause people to update beliefs but they rarely do. Let's not even get started on the IPCC, Mann's hockey stick, etc.

So here's what I'd like to do for each of these supposedly-evidence-based-and-numerical-but-not-really issues. Assemble an expert group of econometricians, modelers, quants and so on. Give them a bunch of unlabeled data. They won't know what problem they're working on or what any of the features are. Ask them to come up with the best predictors they can.

If they determine minimum wages drive unemployment without knowing they're looking at economic data then that's good evidence the two are linked. If their solution uses Stanley Cup winners but not atmospheric CO2 levels to predict tornado frequency then that's good evidence CO2 isn't a driver of tornadoes.

I don't expect this to settle any of these questions once-and-for-all — I don't expect anything at all will do that. There are too many problems (who decides what goes in the data set or how it's cleaned or scaled or lagged?). But I think doing things double-blind like this would create a lot more confidence in econometric-style results. In a way it even lessens the data-trawling problem by stepping into the issue head-on: no more doubting how much the researchers just went fishing for any correlation they could find, because we know that's exactly what they did, so we can be fully skeptical of their results.


  1. I also ran up against computational constraints here. I'm using almost every CPU cycle (and most of the RAM) I can get my hands on to run some last-minute analysis for the aforementioned paper submission, so I didn't have a lot of resources left over to throw at this. To top it off there's a bunch of end-of-semester server maintenance going on which both took processors out of the rotation and prevented me from parallelizing this the way I wanted. []
Posted in Business / Economics, CS / Science / Tech / Coding | Tagged , , | Leave a comment

Command line history

Jude Robinson :: The single most useful thing in bash

Create ~/.inputrc and fill it with this:

"\e[A": history-search-backward
"\e[B": history-search-forward
set show-all-if-ambiguous on
set completion-ignore-case on

This allows you to search through your history using the up and down arrows … i.e. type cd / and press the up arrow and you'll search through everything in your history that starts with cd /.

Wow. That is not an exaggeration at all: the most useful thing. I am so thrilled to finally be able to search my shell history the same way I can my Matlab history. I've been able to do this there for ages and my mind still hasn't caught up with not being able to do it in the shell.

If it's not clear to you why this is useful or why it pleases me, I don't think there's anything I can do to explain it. Sorry.


PS Anyone have first-hand experience with the fish shell? The autocompletions and inline, automatic syntax highlighting seem clever. I need to get around to giving it a try on one of my boxes.

Posted in CS / Science / Tech / Coding | Tagged , | Leave a comment

Pi

The Economist :: Babbage Blog :: Humble Pi

The Raspberry Pi is the brainchild of a couple of computer scientists at Cambridge University. Back in 2006, they lamented the decline in programming skills among applicants for computer-science courses. ... Over the past ten years, computer-science students have gone from arriving at university with a knowledge of several assembly and high-level programming languages to having little more than a working knowledge of HTML, Javascript and perhaps PHP—simple tools for constructing web sites. To learn a computer language, “you’ve got to put in your 10,000 hours,” says Dr Upton. “And it’s a lot easier if you start when you’re 18.” Some would say it is even better to start at 14.

The problem is not a lack of interest, but the lack of cheap, programmable hardware for teenagers to cut their teeth on. For typical youngsters, computers have become too complicated, too difficult to open (laptops especially) and alter their settings, and way too expensive to tinker with and risk voiding their warranty by frying their innards.

I don't see the connection between learning to code and having super-cheap hardware. Back when I was a kid learning to program I actually had to pay real money for a compiler. (Anyone else remember Borland Turbo C++?) Now you're tripping over free languages and environments to use, including many that run entirely through your browser so there's zero risk to your machine.

Honestly how many teens are going to go full-David Lightman and be doing serious enough hacking that their hardware is at risk? Is the goal to make sure teens have the opportunity to start learning to code before college, or to give them hardware to tinker with? Those are both fine goals. Being a software guy I'd put more weight on the former, but the important thing is that the way to accomplish either are completely different.

The Pi is a great way to meet the goal of giving people cheap hardware to experiment with. But if the goal is to give kids an opportunity to start racking up their 10k hours in front of an interpeter or compiler then projects like Repl.it are a lot better. (Repl.it has in-browser interpreters for JavaScript, Ruby, Python, Scheme and a dozen other languages.)

For starters, [your correspondant] plans to turn his existing Raspberry Pi into a media centre. By all accounts, Raspbmc—a Linux-based operating system derived from the XBox game-player’s media centre—is a pretty powerful media player. The first task, then, is to rig the Raspberry Pi up so it can pluck video off the internet, via a nearby WiFi router, and stream it direct to a TV in the living room. Finding out not whether, but just how well, this minimalist little computer manages such a feat will be all part of the fun.

I did this exact project about a month ago, and couldn't be more pleased with either the results or the fun of getting it to work. I still have to tinker with some things: the Vimeo plugin won't log into my account, and I need to build a case. Other than that, I wish I had done this a long time ago.

Posted in CS / Science / Tech / Coding | Tagged , , , , | Leave a comment

Ruby

Coding Horror :: Jeff Atwood :: Why Ruby?

I've always been a little intrigued by Ruby, mostly because of the absolutely gushing praise Steve Yegge had for the language way back in 2006. I've never forgotten this.

For the most part, Ruby took Perl's string processing and Unix integration as-is, meaning the syntax is identical, and so right there, before anything else happens, you already have the Best of Perl. And that's a great start, especially if you don't take the Rest of Perl. But then Matz took the best of list processing from Lisp, and the best of OO from Smalltalk and other languages, and the best of iterators from CLU, and pretty much the best of everything from everyone.

Amen. My Ruby still isn't as idiomatic as I would like it to be, but even with my intermediate level I'm so pleased how it gets out of the way lets me do what I want no matter what paradigm I'm thinking in at the time.

Atwood has this to say himself, which also gets my thumbs-up:

Ruby isn't cool any more. Yeah, you heard me. It's not cool to write Ruby code any more. All the cool people moved on to slinging Scala and Node.js years ago. Our project isn't cool, it's just a bunch of boring old Ruby code. Personally, I'm thrilled that Ruby is now mature enough that the community no longer needs to bother with the pretense of being the coolest kid on the block. That means the rest of us who just like to Get Shit Done can roll up our sleeves and focus on the mission of building stuff with our peers rather than frantically running around trying to suss out the next shiny thing.

I suppose it's in the nature of craftsmen to get a bit giddy about their gear, but at a certain point enough is enough. (This lead to some frustration when I was learning photography a couple of years ago; photography amateurs seem especially prone to gear obsession.) Yeah, definitely choose the right tool for the job, but don't forget the point is the job, not the tool.

I've endured far too many conversations in the break room, and at barbecues, and at happy hour about the supremacy of various languages.

I don't care which languages people are going to be drooling over at SIGPLAN this year.

I don't care what static analysis theorems you can prove using which language.

I don't care how cool your tools are.

I care how cool your results are.

Posted in CS / Science / Tech / Coding | Leave a comment

"Social networks in fact and fiction"

The Endeavour :: John Cook :: Social networks in fact and fiction

In a nutshell, the authors hope to get some insight into whether a myth is based on fact by seeing whether the social network of characters in the myth looks more like a real social network or like the social network in a work of deliberate fiction. For instance, the social networks of the Iliad and Beowulf look more like actual social networks than does the social network of Harry Potter. Real social networks follow a power law distribution more closely than do social networks in works of fiction.

If you read one version of Beowulf and it's not Seamus Heaney's translation you better have a very good reason.
If you read one version of Beowulf and it's not Seamus Heaney's translation you better have a very good reason.
Cool.

I vaguely remember reading about some astronomer's using Homer's descriptions of constellations to pin down dates for various events in the Odyssey. Perhaps it was this PNAS paper by Baikouzis & Magnasco?

It seems however that an accurate historical account might have a suspicious social network, not because the events in it were made up but because they were filtered according to what the historian thought was important.

Indeed. Although I suspect this form of Narrative Bias would be less of a problem with Beowulf, the Illiad, etc., because the composers of those tales, and their audiences, had less exposure to the way fiction is "supposed" to be.

I would like to see someone do similar analysis for contemporary non-fiction. I prefer authors who tell a sprawling, tangled, less narratively driven story (Keay, Mann, Lewis, and Mukherjee come to mind) to one that fits a more conventional structure.  It's difficult for me to accept a story that takes place in a high causal density environment and yet somehow only a couple of people have any agency.

Nate Silver apparently feels the same way:

When we construct these stories, we can lose the ability to think about the evidence critically. Elections typically present compelling narratives. Whatever you thought about the politics of Barack Obama or Sarah Palin or John McCain or Hillary Clinton in 2008, they had persuassive life stories: reported books on the campaign, like Game Change, read like tightly bestselling novels.
— The Signal and the Noise, p. 59

Those are not the books I'm interested in.


NLP is very much not my area, but I can't help but wonder about automatically generating some sort of metric about this for a book. Count up how many proper names appear, and build a graph of them based on how closely together they are linked in the text. (That is, two names appearing often in the same sentence are more closely linked than two which appear in consecutive paragraph.) Perhaps simply looking at the histogram of name occurrence frequency might give you some preliminary ability to separate books into "realistic social structures" and "fiction-like social structures."

Posted in CS / Science / Tech / Coding | Tagged , | Leave a comment

Conway's Game Of Life in APL

APL is the sort of thing I imagine "Programmer-at-Arms" Pham Nuwen used in Vernor Vinge's "Zones of Thought" books. It seems both incomprehensibly magical while also impeccably rational, forming the shortest bridge between what the programmer is thinking and what the computer is calculating.

Of course it also seems like you'd be building a bridge no one else could walk over. Not even you-one-year-from-now. I have a hard enough time figuring out what I was thinking when I read Ruby I wrote a year ago.


For more, here's a written tutorial by the same fellow. It had this helpful poscript regarding "glider guns"

Artillery in a finite manifold is a dangerous business.

Hehe. Good advice, I suppose.

Aficionados of the 1979 arcade game "Asteroids", which took place on a torus, will recall that the manufacturers thoughtfully gave their canon a range just short of the screen's circumferences.

Found that out the hard way back in undergrad when we wrote a genetic algorithm which learned to play Asteroids.

Posted in CS / Science / Tech / Coding | Leave a comment

Manfred Mohr

I saw this post on Manfred Mohr's (very!) early digital art literally minutes after picking up a book about Mohr's work from the library yesterday. Total coincidence.

p231, Manfred Mohr
p231, Manfred Mohr

I like a lot of the pioneering digital artists' work. I think the restraints gave them a lot of focus. (Limits boost creativity.) You don't see many of those first generation guys still producing like Mohr does though. It's also refreshingly unusual that Mohr hasn't gotten bogged down in self-consciously retro, 8-bit, glitch kitsch.

(Dear [far too many digital artists]: I also remember that sprite animations were once a cutting-edge thing. Move on.)

Back when I was taking Computational Geometry I remember getting sidelined working out an algorithm based on the six dimensional rotations in Mohr's space.color.motion. At some point I need to dig my notes out for that and implement it.

If you like Mohr you may want to check out this interview as well. I like what he says at the beginning about taking a simple idea and running with it. Also the questions about an artist working like a scientist were good. This also stood out:

So I went higher and higher in dimension and it got more and more complicated. I could do different things, but I could not explain to people because it was so complicated. '6-D cube, what kind of crap are you talking about?'

I've had that experience. Here's a free tip for you: do not try to explain something to a room full of psychologists by beginning "So if you were to imagine yourself moving around on the surface of a 35-dimensional hypercube..."

This interview also displays the link between algorithmic art and political economy. Mohr sets up a system, and lets it do its thing. The artwork emerges from the software. Legislators and regulators and executives set up a system and it does its thing. The economy emerges from society. Mohr's work, like all good algorithmic art, reminds me of Hayek's description of spontaneous order: "of human intent but not of human design."


PS Prosthetic Knowledge points out that Mohr also maintains a very thorough YouTube channel. Would that all digital artists did similar.

Posted in Art, CS / Science / Tech / Coding | Leave a comment

Reader

Google is shutting down the Google Reader service, as you may have heard.

I am not happy about this, but I'm also not throwing fits about it. I spent more time in my Reader tab than any other by a huge margin. I'm sure there's some other service out there that will fill the void. Google is also being pretty considerate about this: they're giving users several months warning and they've made tools available to export your subscription data.

RWCG has had a couple of posts taking the wind out the sails of all the people who are tearing out their hair and rending their garments and casting the evil eye towards Mountain View. Overall I think he's on the right track, but this post doesn't quite line up for me:

RWCG | Crimson Reach | Google Reader: The end of the ‘free’ era

How dare Google shut down a service I made heavy use of for years and years and paid nothing whatsoever for! [...] I wonder if this event marks the end of the ‘free’ era. You know the free era, it’s the era where this was the prevailing business philosophy:

1. Drive competing services out of business with a free service (subsidized by a profitable product).
2. Cancel free service.
3. ???

Actually no, that’s still snarky. It’s more like this:

1. Company gives away something for free.
2. People like it and use it.
3. This makes people think the company is cool. Their friend.
4. When it goes public, they buy its stock, cuz it’s so cool, and everyone they know uses and likes it. Surely that’s gotta mean something, stock-wise.
5. People continue to use the free thing and come to not only rely on it but expect it as their birthright.
6. ...
7. Profit?

According to this philosophy, giving away cool stuff for free was the wave of the future. It’s what all smart, cool companies did. Only backward knuckle-dragging idiots couldn’t figure out how this added up to a business model. Economic realities were no longer reality.

I think he's short-changing the business potential of a product ("product"?) like Google Reader. There's no direct line between "make Reader & give it away free" and "profit," but this approach still has some uses.

1. Yes, it makes people think you're cool and friendly and not-evil. Many firms do things for that reason alone. They spend billions on things way outside their core competencies just so people think they're cool and friendly. Isn't that the entire point of the "Corporate Social Responsibility" fad?

Google paying its employees to create Reader is in it's wheelhouse; it makes sense. Far more sense than, for example, Chrysler paying its employees to lay bathroom tile in poor neighborhoods.

2. Providing services like Reader makes Google look cool to people generally, but more importantly it makes them look cool to geeks. It's a punchline that a firm's number one asset is it's people, but that's pretty true about Google. They can do what they do because they get the pick of the litter of hackers.

I went to career fair my CS department sponsored a month or so ago. The line for the Google table was literally out the door. Most people in my graduate program are angling for academic jobs. Google is one of maybe four private companies that people will be impressed you're interviewing with.

3. Projects like Reader not only motivate applicants, they motivate employees.

Talent and productivity are extremely unevenly distributed in coders. The best are many orders of magnitude better than the median; the bottom decile (conservatively) have negative productivity. You usually don't get the best by offering them orders of magnitude more money, you get them by giving them cool problems to work on.

If you're excited about spending some time developing X, there's a good chance Google will let you do that. (At least in comparison to if you were working at Initech.) What's more, there's a chance Google will roll out X to millions of people, like they did with Reader. I can't stress enough how big of a motivator that can be.

4. Google's strategy for a while has been that anything they do to make people want to use the internet more is good for them, because they capture a dominant slice of the ad revenue online. More people spending more time online is better for Google, period. Reader fits into that. That's not a strategy that will work forever, or for many (any?) other companies. It can also be used to justify a lot of wasted "investments." But it's also true.

5. Google lives and dies off of data. Reader could have been generating that for them. I have no idea how much they actually learned from people's reading habits, if anything, but it had the potential to be a goldmine.

If you can predict people's age, race, religion, political party and drug use only using publicly available "likes" on Facebook, what could you do with my online reading habits? (Answer: sooooo much.)


I have no idea if it was a good idea or a bad one for Google to shut off Reader. I'm skeptical, but I realize I have none of the facts. I can't imagine it would cost that much to keep it running as is, especially compared to their other projects. I'm not sure what better use they have for the resources they're redeploying. I'm curious that they didn't even try to make it ad supported. Hell, I would have even paid directly to keep using it, and I pay for approximately zero online subscriptions.

Again, I don't know what they know. But I do know that "there's no revenue from this free product; let's shut it down" should not be the beginning and end of this decision making.

Posted in Business / Economics, CS / Science / Tech / Coding | Tagged , | Leave a comment