Tag Archives: projects

MalConv: Lessons learned from Deep Learning on executables

I don't usually write up my technical work here, mostly because I spend enough hours as is doing technical writing. But a co-author, Jon Barker, recently wrote a post on the NVIDIA Parallel For All blog about one of our papers on neural networks for detecting malware, so I thought I'd link to it here. (You can read the paper itself, "Malware Detection by Eating a Whole EXE" here.) Plus it was on the front page of Hacker News earlier this week, which is not something I thought would ever happen to my work.

Rather than rehashing everything in Jon's Parallel for All post about our work, I want to highlight some of the lessons we learned from doing this about ML/neural nets/deep learning.

As way of background, I'll lift a few paragraphs from Jon's introduction:

The paper introduces an artificial neural network trained to differentiate between benign and malicious Windows executable files with only the raw byte sequence of the executable as input. This approach has several practical advantages:

  • No hand-crafted features or knowledge of the compiler used are required. This means the trained model is generalizable and robust to natural variations in malware.
  • The computational complexity is linearly dependent on the sequence length (binary size), which means inference is fast and scalable to very large files.
  • Important sub-regions of the binary can be identified for forensic analysis.
  • This approach is also adaptable to new file formats, compilers and instruction set architectures—all we need is training data.

We also hope this paper demonstrates that malware detection from raw byte sequences has unique and challenging properties that make it a fruitful research area for the larger machine learning community.

One of the big issues we were confronting with our approach, MalConv, is that executables are often millions of bytes in length. That's orders of magnitude more time steps than most sequence processing networks deal with. Big data usually refers to lots and lots of small data points, but for us each individual sample was big. Saying this was a non-trivial problem is a serious understatement.

The MalConv architecture
Architecture of the malware detection network. (Image copyright NVIDIA.)

Here are three lessons we learned, not about malware or cybersecurity, but about the process of building neural networks on such unusual data.

1. Deep learning != image processing

The large majority of the work in deep learning has been done in the image domain. Of the remainder, the large majority has been in either text or speech. Many of the lessons, best practices, rules of thumb, etc., that we think apply to deep learning may actually be specific to these domains.

For instance, the community has settled around narrow convolutional filters, stacked with a lot of depth as being generally the best way to go. And for images, narrow-and-deep absolutely seems to be the correct choice. But in order to get a network that processes two million time steps to fit in memory at all (on beefy 16GB cards no less) we were forced to go wide-and-shallow.

With images, a pixel values is always a pixel value. 0x20 in a grayscale image is always darkish gray, no matter what. In an executable, a byte values are ridiculously polysemous: 0x20 may be part of an instruction, a string, a bit array, a compressed or encrypted values, an address, etc. You can't interpolate between values at all, so you can't resize or crop the way you would with images to make your data set smaller or introduce data augmentation. Binaries also play havoc with locality, since you can re-arrange functions in any order, among other things. You can't rely on any Tobbler's Law ((Everything is related, but near things are more related than far things.)) relationship the way you can in images, text, or speech.

2. BatchNorm isn't pixie dust

Batch Normalization has this bippity-boppity-boo magic quality. Just sprinkle it on top of your network architecture, and things that didn't converge before now do, and things that did converge now converge faster. It's worked like that every time I've tried it — on images. When we tried it on binaries it actually had the opposite effect: networks that converged slowly now didn't at all, no matter what variety of architecture we tried. It's also had no effect at all on some other esoteric data sets that I've worked on.

We discuss this at more length in the paper (§5.3), but here's the relevant figure:

BatchNorm activations
KDE plots of the convolution response (pre-ReLU) for multiple architectures. Red and orange: two layers of ResNet; green: Inception-v4; blue: our network; black dashed: a true Gaussian distribution for reference.

This is showing the pre-BN activations from MalConv (blue) and from ResNet (red & orange) and Inception-v4 (green). The purpose of BatchNorm is to output values in a standard normal, and it implicitly expects inputs that are relatively close to that. What we suspect is happening is that the input values from other networks aren't gaussian, but they're close-ish. ((I'd love to be able to quantify that closeness, but every test for normality I'm aware of doesn't apply when you have this many samples. If anyone knows of a more robust test please let me know.)) The input values for MalConv display huge asperity, and aren't even unimodal. If BatchNorm is being wonky for you, I'd suggest plotting the pre-BN activations and checking to see that they're relatively smooth and unimodal.

3. The Lump of Regularization Fallacy

If you're overfitting, you probably need more regularization. Simple advice, and easily executed. Everytime I see this brought up though, people treat regularization as if it's this monolithic thing. Implicitly, people are talking as if you have some pile of regularization, and if you need to fight overfitting then you just shovel more regularization on top. It doesn't matter what kind, just add more.

We ran in to overfitting problems and tried every method we could think of: weight decay, dropout, regional dropout, gradient noise, activation noise, and on and on. The only one that had any impact was DeCov, which penalized activities in the penultimate layer that are highly correlated with each other. I have no idea what will work on your data — especially if it's not images/speech/text — so try different types. Don't just treat regularization as a single knob that you crank up or down.

I hope some of these lessons are helpful to you if you're into cybersecurity, or pushing machine learning into new domains in general. We'll be presenting the paper this is all based on at the Artificial Intelligence for Cyber Security (AICS) workshop at AAAI in February, so if you're at AAAI then stop by and talk.

Posted in CS / Science / Tech / Coding | Tagged , , , , , , | Leave a comment

Reading List for 16 July 2013

Evan Miller :: Winkel Tripel Warping Trouble or "How I Found a Bug in the Journal of Surveying Engineering"

All programming blogs need at least one post unofficially titled “Indisputable Proof That I Am Awesome.” These are usually my favorite kind of read, as the protagonist starts out with a head full of hubris, becomes mired in self-doubt, struggles on when others would have quit, and then ultimately triumphs over evil (that is to say, slow or buggy computer code), often at the expense of personal hygiene and/or sanity.

I'm a fan of the debugging narrative, and this is a fine example of the genre. I've been wrestling with code for mapping projections recently, so I feel Miller's pain specifically. In my opinion the Winkel Tripel is mathematically gross, but aesthetically unsurpassed. Hopefully I'll find some time in the next week or so to put up a post about my mapping project.

Irene Global Tweets WInkel Tripel
A screenshot of a project I've been working on to map geotagged tweets.

Kevin Grier :: Breaking down the higher ed wage premium

wage premium by major
Wage premium and popularity of majors

File under "all college degrees are not created equal" or perhaps "no, junior, you may not borrow enough to buy a decent house in order to get a BA in psych."

Aleatha Parker-Wood :: One Shot vs Iterated Games

Social cohesion can be thought of as a manifestation of how "iterated" people feel their interactions are, how likely they are to interact with the same people again and again and  have to deal with long term consequences of locally optimal choices, or whether they feel they can "opt out" of consequences of interacting with some set of people in a poor way.

Mike Munger :: Grade Inflation? Some data

Munger links to some very good analysis but it occurs to me that what is really needed is the variance of grades over time and not just the mean. (Obviously these two things are related since the distribution is bounded by [0, 4]. A mean which has gone from 2.25 to 3.44 will almost certainly result in less variance here.)

I don't much care where the distribution is centered. I care how wide the distribution is — that's what lets observers distinguish one student from another. Rankings need inequality. Without it they convey no information.

Marginal Revolution :: Alex Tabarrok :: The Battle over Junk DNA

I share Graur's and Tabarrok's wariness over "high impact false positives" in science. This is a big problem with no clear solutions.

The Graur et al. paper that Tabarrok discusses is entertaining in its incivility. Sometimes civility is not the correct response to falsehoods. It's refreshing to see scientists being so brutally honest with their opinions. Some might say they are too brutal, but at least they've got the honest part.

Peter McCaffrey :: 5 reasons price gouging should be legal: Especially during disasters

McCaffrey is completely right. But good luck to him reasoning people out of an opinion they were never reasoned into in the first place.

I do like the neologism "sustainable pricing" that he introduces. Bravo for that.

I would add a sixth reason to his list: accusations of "price gouging" are one rhetorical prong in an inescapable triple bind. A seller has three MECE choices: price goods higher than is common, the same as is common, or lower than is common. These choices will result in accusations of price gouging, collusion, and anti-competitive pricing, respectively. Since there is no way to win when dealing with people who level accusations of gouging, the only sensible thing to do is ignore them.

Shawn Regan :: Everyone calm down, there is no “bee-pocalypse”

Executive summary: apiarists have agency, and the world isn't static. If the death rate of colonies increases, they respond by creating more colonies. Crisis averted.

Eliezer Yudkowsky :: Betting Therapy

"Betting Therapy" should be a thing. You go to a betting therapist and describe your fears — everything you're afraid will happen if you do X — and then the therapist offers to bet money on whether it actually happens to you or not. After you lose enough money, you stop being afraid.

Sign me up.

Posted in Reading Lists | Tagged , , , , , , | Leave a comment

Kaggle Black Box

This is the second machine learning competition hosted at Kaggle that I've gotten serious about entering and sunk time into only to be derailed by a paper deadline. I'm pretty frustrated. Since I didn't get a chance to submit anything for the contest itself, I'm going to outline the approach I was trying here.

First a bit of background on this particular contest. The data is very high dimensional (1875 features) and multicategorical (9 classes). You get 1000 labeled training points, which isn't nearly enough to learn a good classifier on this data. In addition you get ~130000 unlabeled points. The goal is to leverage all the unlabeled data to be able to build a decent classifier out of the labeled data. To top it off you have no idea what the data represents, so it's impossible to use any domain knowledge.

I saw this contest a couple of weeks ago shortly after hearing a colleague's PhD proposal. His topic is the building networks of Kohonen Self-Organizing Maps for time series data, so SOMs are where my mind went first. SOMs are a good fit for this task: they can learn on labeled or unlabeled data, and they're excellent at dimensionality reduction.

An SOM of macroeconomic features. From Sarlin, "Exploiting the self-organizing financial stability map," 2013.
An SOM of macroeconomic features. From Sarlin, "Exploiting the self-organizing financial stability map," 2013.

My approach was to use the unlabeled training data to learn a SOM, since they lend themselves well to unsupervised learning. Then I passed the labeled data to the SOM. The maximally active node (i.e. the node whose weight vector best matches the input vector, aka the "best matching unit" or BMU) got tagged with the class of that training sample. Then I could repeat with the test data, and read out the class(es) tagged to the BMU for each data point.

So far that's simple enough, but there is far too much data to learn a SOM on efficiently, ((I also ran up against computational constraints here. I'm using almost every CPU cycle (and most of the RAM) I can get my hands on to run some last-minute analysis for the aforementioned paper submission, so I didn't have a lot of resources left over to throw at this. To top it off there's a bunch of end-of-semester server maintenance going on which both took processors out of the rotation and prevented me from parallelizing this the way I wanted.)) so I turned to my old ensemble methods.

[1] SOM bagging. The most obvious approach in many ways. Train each network on only a random subset of the data. The problem here is that any reasonable fraction of the data is still too big to get into memory. (IIRC Breiman's original Bagging paper used full boostraps, i.e. resamples the same size as the original set and even tested using resamples larger than the original data. That's not an option for me.) I could only manage 4096 data points (a paltry 3% of the data set) in each sample without page faulting. (Keep in mind again that a big chunk of this machine's memory was being used on my actual work.)

[2] SOM random dendrites. Like random forests, use the whole data set but only select a subset of the features for each SOM to learn from. I could use 64 of 1985 features at a time. This is also about 3%; the standard is IIRC more like 20%.

In order to add a bit more diversity to ensemble members I trained each for a random number of epochs between 100 and 200. There are a lot of other parameters that could have been adjusted to add diversity: smoothing, distance function and size of neighborhoods, size of network, network topology, ...

This is all pretty basic. There tricky part is combining the individual SOM predictions. For starters, how should you make a prediction with a single SOM? The BMU often had several different classes associated with it. You can pick whichever class has a plurality, and give that network's vote to that class. You can assign fractions of its vote in proportion to the class ratio of the BMU. You can take into account the distance between the sample of the BMU, and incorporate the BMU's neighbors. You can use a softmax or other probabilistic process. You can weight nodes individually or weight the votes of each SOM. This weighting can be done the traditional way (e.g. based on accuracy on a validation set) or in a way that is unique to the SOM's competitive learning process (e.g. how many times was this node the BMU? what is the distance in weight-space between this node and its neighbors? how much has this node moved in the final training epochs?).

At some point I'm going to come back to this. I have no idea if Kaggle keeps the infrastructure set up to allow post-deadline submissions, but I hope they do. I'd like to get my score on this just to satisfy my own curiosity.


This blackbox prediction concept kept cropping up in my mind while reading Nate Silver's The Signal and the Noise. We've got all these Big Questions where we're theoretically using scientific methods to reach conclusions, and yet new evidence rarely seems to change anyone's mind.

Does Medicaid improve health outcomes? Does the minimum wage increase unemployment? Did the ARRA stimulus spending work? In theory the Baicker et al. Oregon study, Card & Krueger, and the OMB's modeling ought to cause people to update beliefs but they rarely do. Let's not even get started on the IPCC, Mann's hockey stick, etc.

So here's what I'd like to do for each of these supposedly-evidence-based-and-numerical-but-not-really issues. Assemble an expert group of econometricians, modelers, quants and so on. Give them a bunch of unlabeled data. They won't know what problem they're working on or what any of the features are. Ask them to come up with the best predictors they can.

If they determine minimum wages drive unemployment without knowing they're looking at economic data then that's good evidence the two are linked. If their solution uses Stanley Cup winners but not atmospheric CO2 levels to predict tornado frequency then that's good evidence CO2 isn't a driver of tornadoes.

I don't expect this to settle any of these questions once-and-for-all — I don't expect anything at all will do that. There are too many problems (who decides what goes in the data set or how it's cleaned or scaled or lagged?). But I think doing things double-blind like this would create a lot more confidence in econometric-style results. In a way it even lessens the data-trawling problem by stepping into the issue head-on: no more doubting how much the researchers just went fishing for any correlation they could find, because we know that's exactly what they did, so we can be fully skeptical of their results.

Posted in Business / Economics, CS / Science / Tech / Coding | Tagged , , | Leave a comment

Pi

The Economist :: Babbage Blog :: Humble Pi

The Raspberry Pi is the brainchild of a couple of computer scientists at Cambridge University. Back in 2006, they lamented the decline in programming skills among applicants for computer-science courses. ... Over the past ten years, computer-science students have gone from arriving at university with a knowledge of several assembly and high-level programming languages to having little more than a working knowledge of HTML, Javascript and perhaps PHP—simple tools for constructing web sites. To learn a computer language, “you’ve got to put in your 10,000 hours,” says Dr Upton. “And it’s a lot easier if you start when you’re 18.” Some would say it is even better to start at 14.

The problem is not a lack of interest, but the lack of cheap, programmable hardware for teenagers to cut their teeth on. For typical youngsters, computers have become too complicated, too difficult to open (laptops especially) and alter their settings, and way too expensive to tinker with and risk voiding their warranty by frying their innards.

I don't see the connection between learning to code and having super-cheap hardware. Back when I was a kid learning to program I actually had to pay real money for a compiler. (Anyone else remember Borland Turbo C++?) Now you're tripping over free languages and environments to use, including many that run entirely through your browser so there's zero risk to your machine.

Honestly how many teens are going to go full-David Lightman and be doing serious enough hacking that their hardware is at risk? Is the goal to make sure teens have the opportunity to start learning to code before college, or to give them hardware to tinker with? Those are both fine goals. Being a software guy I'd put more weight on the former, but the important thing is that the way to accomplish either are completely different.

The Pi is a great way to meet the goal of giving people cheap hardware to experiment with. But if the goal is to give kids an opportunity to start racking up their 10k hours in front of an interpeter or compiler then projects like Repl.it are a lot better. (Repl.it has in-browser interpreters for JavaScript, Ruby, Python, Scheme and a dozen other languages.)

For starters, [your correspondant] plans to turn his existing Raspberry Pi into a media centre. By all accounts, Raspbmc—a Linux-based operating system derived from the XBox game-player’s media centre—is a pretty powerful media player. The first task, then, is to rig the Raspberry Pi up so it can pluck video off the internet, via a nearby WiFi router, and stream it direct to a TV in the living room. Finding out not whether, but just how well, this minimalist little computer manages such a feat will be all part of the fun.

I did this exact project about a month ago, and couldn't be more pleased with either the results or the fun of getting it to work. I still have to tinker with some things: the Vimeo plugin won't log into my account, and I need to build a case. Other than that, I wish I had done this a long time ago.

Posted in CS / Science / Tech / Coding | Tagged , , , , | Leave a comment

Computational Cartography (Part 2)

In the previous post I talked a bit about how one might build a system which would algorithmically divide the US into 50 states of approximately equal populations.

I've messed around with the problem a bit and developed a very rough approximation. Actually, I don't even feel comfortable calling this an approximation. But it is at least the scaffolding you could build an eventual solution around.

auto sized states - after

It imports the relevant data from the Census (after a lot of PitA data cleaning) and throws it up on the screen, connecting neighboring counties. Then it picks 50 random counties to be the centers of the states and runs a k-means update on them (weighted by the county populations) to iterate to a better solution. The only wrinkle to a basic clustering is that it uses the population of a state to set a z-coordinate value for each centroid when determining the distance to each point. States with more people in them are represented higher up, making them effectively smaller.

The image above is what you end up with; here's where the system started:

auto sized states - before

The ratio of the most populous state to the least in America now is 66:1. As simple as this system is, some of the runs I tested on got that ratio down to 11:1.

There's a thousand things to do to this to make it an actual solution to the problem of automatically re-drawing state boundaries. I need to do things from the small (use spherical geometry to accurately calculate the distance between pairs of lat/long coordinates), to the medium (link AK/HI/PR to the mainland better), to the major (balance state populations more explicitly). Nevertheless, I think this is a useful framework in to build some intelligent decision making on top of.

Posted in CS / Science / Tech / Coding | Tagged | 1 Comment

Computational Cartography (Part 1)

In the sixth grade my Social Studies teacher lent me a book which had a whole section devoted to alternative maps of the US. What would it look like with a dozen states? Or 200? What if state lines were drawn according to climate and vegetation patterns, or ancestral ethnicities, or any other speculative idea? I'm fascinated by maps generally, and these were no exception.

Here's a example of a speculative map of the US by Neil Freeman that's been making the rounds in the last few days:

The United States redrawn as Fifty States with Equal Population
"Electoral college reform," Neil Freeman

My first reaction to seeing this was: Cool. Now how to do I automate this? Those two thoughts go together in my head a lot. Rather than just idly speculating how to get a machine to do this, I'm actually going to give it a try. Or at least give a try at a first approximation.

Freeman's method was initialized algorithmically, but a lot of the decisions were subjective. He also uses a lot of data (drainage basins, commuting patterns) that would be too much of a pain for me to track down an integrate for this sort of spare-time project.

I'm starting by doing everything on a county-by-county basis, which is a useful limit to work with. Without limiting myself to some relatively coarse-grained pre-existing boundaries the computational possibilities would be staggeringly vast, and the result would probably look too foreign anyway. Limiting things to the 3,033 counties and county-equivalents (Louisianan parishes, Alaskan boroughs, Virginian independent cities, ...) will make things more manageable.

I'll need population data for each county, which I can get from the Census Bureau. I'll also need an adjacency graph of counties, which I thought would be tricky, but it turns out the Census Bureau has me covered their too. The location of each county is given in that first link. For the final rendering I'll need the shape of each county. Location is simple, but I'll have to do some digging for shape.

Methods... methods... hmmm.

A clustering approach seems like a natural way to start. Something like k-means where you start with some arbitrary, incorrect centers for each state, and move them around until each state is approximately equally sized. That would at least give you something to refine from.

K-means is related to Learning Vector Quantization, which was an ancestor of Self-Organizing Maps, so that's another direction to explore. I'm not sure how you'd enforce the population similarity complaint off the top of my head. Or, for that matter, how you'd want to encode any of the information in feature space. (This 2011 paper by Skupin and Esperbé seems like a solid lead [pdf].) I'm not sure how an SOM would work for this but I have a strong feeling there's potential down that road.

There's definitely a greedy hill-climbing approach that could be easily implemented. Which means there's a Simulated Annealing approach as well. No way of knowing a priori how effective either would be.

Those techniques — and many others — will need a fitness function. It would be simple enough to write one based on the inequality in state populations. (Ratio of smallest to largest, Gini coefficient, ...) There should be a penalty for non-compact states, perhaps based on perimeter-to-area raio. But that may cause a problem for states along coast lines or rivers. Perhaps use the adjacency graph and penalize states with large diameters, weighted by their actual distance in geographic-space, relative to the number of counties. In any event, Hawaii and Alaska may need special attention for any distance-based rules.

Genetic Algorithms might come in handy, especially if you already had a reasonable starting point to refine from. How would a chromosome be represented though? Each county is one gene, taking available values of 1-50? The state space would be huge. Something better than the naive approach would be needed.

I'm sure there are notes in either my Computational Geometry or GIS notebooks that would come in handy. I'll have to flip through those next time I'm in the lab. Maybe my Algorithmic Game Theory notes as well; I seem to remember some relevant problems on siting cell towers that may be adaptable.

This should give me plenty to get started with. I've got a lot of other irons in the fire, but hopefully in a few days I'll be back with another post laying out my preliminary efforts.

Posted in CS / Science / Tech / Coding | Tagged | 2 Comments