{"id":1144,"date":"2017-12-08T10:23:15","date_gmt":"2017-12-08T15:23:15","guid":{"rendered":"http:\/\/www.jsylvest.com\/blog\/?p=1144"},"modified":"2017-12-08T10:23:15","modified_gmt":"2017-12-08T15:23:15","slug":"malconv","status":"publish","type":"post","link":"https:\/\/www.jsylvest.com\/blog\/2017\/12\/malconv\/","title":{"rendered":"MalConv: Lessons learned from Deep Learning on executables"},"content":{"rendered":"<p>I don't usually write up my technical work here, mostly because I spend enough hours as is doing technical writing. But a co-author, Jon Barker, recently wrote\u00a0<a href=\"https:\/\/devblogs.nvidia.com\/parallelforall\/malware-detection-neural-networks\/\">a post on the NVIDIA Parallel For All blog<\/a> about one of our papers on neural networks for detecting malware, so I thought I'd link to it here. (You can read the paper itself, <a href=\"https:\/\/arxiv.org\/abs\/1710.09435\">\"Malware Detection by Eating a Whole EXE\" here<\/a>.) Plus it was on the front page of Hacker News earlier this week, which is not something I thought would ever happen to my work.<\/p>\n<p>Rather than rehashing everything in Jon's Parallel for All post about our work, I want to highlight some of the lessons we learned from doing this about ML\/neural nets\/deep learning.<\/p>\n<p>As way of background, I'll lift a few paragraphs from Jon's introduction:<\/p>\n<blockquote><p>The paper introduces an artificial neural network trained to differentiate between benign and malicious Windows executable files with only the raw byte sequence of the executable as input. This approach has several practical advantages:<\/p>\n<ul>\n<li>No hand-crafted features or knowledge of the compiler used are required. This means the trained model is generalizable and robust to natural variations in malware.<\/li>\n<li>The computational complexity is linearly dependent on the sequence length (binary size), which means inference is fast and scalable to very large files.<\/li>\n<li>Important sub-regions of the binary can be identified for forensic analysis.<\/li>\n<li>This approach is also adaptable to new file formats, compilers and instruction set architectures\u2014all we need is training data.<\/li>\n<\/ul>\n<p>We also hope this paper demonstrates that malware detection from raw byte sequences has unique and challenging properties that make it a fruitful research area for the larger machine learning community.<\/p><\/blockquote>\n<p>One of the big issues we were confronting with our approach, MalConv, is that executables are often millions of bytes in length. That's orders of magnitude more time steps than most sequence processing networks deal with. Big data usually refers to lots and lots of small data points, but for us each individual sample was big. Saying this was a non-trivial problem is a serious understatement.<\/p>\n<figure id=\"attachment_1149\" aria-labelledby=\"figcaption_attachment_1149\" class=\"wp-caption aligncenter\" style=\"width: 565px\"><a href=\"https:\/\/i0.wp.com\/www.jsylvest.com\/blog\/wp-content\/uploads\/2017\/12\/malconv.png\" rel=\"attachment wp-att-1149\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"wp-image-1149 size-full\" src=\"https:\/\/i0.wp.com\/www.jsylvest.com\/blog\/wp-content\/uploads\/2017\/12\/malconv.png?resize=555%2C682\" alt=\"The MalConv architecture\" width=\"555\" height=\"682\" srcset=\"https:\/\/i0.wp.com\/www.jsylvest.com\/blog\/wp-content\/uploads\/2017\/12\/malconv.png?w=555&amp;ssl=1 555w, https:\/\/i0.wp.com\/www.jsylvest.com\/blog\/wp-content\/uploads\/2017\/12\/malconv.png?resize=244%2C300&amp;ssl=1 244w\" sizes=\"auto, (max-width: 555px) 100vw, 555px\" \/><\/a><figcaption id=\"figcaption_attachment_1149\" class=\"wp-caption-text\">Architecture of the malware detection network. (Image copyright NVIDIA.)<\/figcaption><\/figure>\n<p>Here are three lessons we learned, not about malware or cybersecurity, but about the process of building neural networks on such unusual data.<\/p>\n<h3>1. Deep learning != image processing<\/h3>\n<p>The large majority of the work in deep learning has been done in the image domain. Of the remainder, the large majority has been in either text or speech. Many of the lessons, best practices, rules of thumb, etc., that we think apply to deep learning may actually be specific to these domains.<\/p>\n<p>For instance, the community has settled around narrow convolutional filters, stacked with a lot of depth as being generally the best way to go. And for images, narrow-and-deep absolutely seems to be the correct choice. But in order to get a network that processes two million time steps to fit in memory <em>at all<\/em> (on beefy 16GB cards no less) we were forced to go wide-and-shallow.<\/p>\n<p>With images, a pixel values is always a pixel value. <code>0x20<\/code> in a grayscale image is always darkish gray, no matter what. In an executable, a byte values are ridiculously polysemous: <code>0x20<\/code> may be part of an instruction, a string, a bit array, a compressed or encrypted values, an address, etc. You can't interpolate between values at all, so you can't resize or crop the way you would with images to make your data set smaller or introduce data augmentation. Binaries also play havoc with locality, since you can re-arrange functions in any order, among other things. You can't rely on any Tobbler's Law ((Everything is related, but near things are more related than far things.)) relationship the way you can in images, text, or speech.<\/p>\n<h3>2. BatchNorm isn't pixie dust<\/h3>\n<p>Batch Normalization has this bippity-boppity-boo magic quality. Just sprinkle it on top of your network architecture, and things that didn't converge before now do, and things that did converge now converge faster. It's worked like that every time I've tried it \u2014 <em>on images<\/em>. When we tried it on binaries it actually had the opposite effect: networks that converged slowly now didn't at all, no matter what variety of architecture we tried. It's also had no effect at all on some other esoteric data sets that I've worked on.<\/p>\n<p>We discuss this at more length in the paper (\u00a75.3), but here's the relevant figure:<\/p>\n<figure id=\"attachment_1152\" aria-labelledby=\"figcaption_attachment_1152\" class=\"wp-caption aligncenter\" style=\"width: 970px\"><a href=\"https:\/\/i0.wp.com\/www.jsylvest.com\/blog\/wp-content\/uploads\/2017\/12\/bn-activations.png\" rel=\"attachment wp-att-1152\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-1152\" src=\"https:\/\/i0.wp.com\/www.jsylvest.com\/blog\/wp-content\/uploads\/2017\/12\/bn-activations.png?resize=960%2C637\" alt=\"BatchNorm activations\" width=\"960\" height=\"637\" srcset=\"https:\/\/i0.wp.com\/www.jsylvest.com\/blog\/wp-content\/uploads\/2017\/12\/bn-activations.png?resize=1024%2C679&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.jsylvest.com\/blog\/wp-content\/uploads\/2017\/12\/bn-activations.png?resize=300%2C199&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.jsylvest.com\/blog\/wp-content\/uploads\/2017\/12\/bn-activations.png?resize=768%2C510&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.jsylvest.com\/blog\/wp-content\/uploads\/2017\/12\/bn-activations.png?w=1361&amp;ssl=1 1361w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/><\/a><figcaption id=\"figcaption_attachment_1152\" class=\"wp-caption-text\">KDE plots of the convolution response (pre-ReLU) for multiple architectures. Red and orange: two layers of ResNet; green: Inception-v4; blue: our network; black dashed: a true Gaussian distribution for reference.<\/figcaption><\/figure>\n<p>This is showing the pre-BN activations from MalConv (blue) and from ResNet (red &amp; orange) and Inception-v4 (green). The purpose of BatchNorm is to output values in a standard normal, and it implicitly expects inputs that are relatively close to that. What we suspect is happening is that the input values from other networks aren't gaussian, but they're close-ish. ((I'd love to be able to quantify that closeness, but every test for normality I'm aware of doesn't apply when you have this many samples. If anyone knows of a more robust test please let me know.)) The input values for MalConv display huge asperity, and aren't even unimodal. If BatchNorm is being wonky for you, I'd suggest plotting the pre-BN activations and checking to see that they're relatively smooth and unimodal.<\/p>\n<h3>3. The Lump of Regularization Fallacy<\/h3>\n<p>If you're overfitting, you probably need more regularization. Simple advice, and easily executed. Everytime I see this brought up though, people treat regularization as if it's this monolithic thing. Implicitly, people are talking as if you have some pile of regularization, and\u00a0if\u00a0you need to fight overfitting then you just shovel more regularization on top. It doesn't matter what kind, just add more.<\/p>\n<p>We ran in to overfitting problems and tried every method we could think of: weight decay, dropout, regional dropout, gradient noise, activation noise, and on and on. The only one that had any impact was <a href=\"https:\/\/arxiv.org\/abs\/1511.06068\">DeCov<\/a>, which penalized activities in the penultimate layer that are highly correlated with each other. I have no idea what will work on your data \u2014 especially if it's not images\/speech\/text \u2014 so try different types. Don't just treat regularization as a single knob that you crank up or down.<\/p>\n<p>I hope some of these lessons are helpful to you if you're into cybersecurity, or pushing machine learning into new domains in general. We'll be presenting the paper this is all based on at the <a href=\"http:\/\/www-personal.umich.edu\/~arunesh\/AICS2018\/\">Artificial Intelligence for Cyber Security<\/a> (AICS)\u00a0workshop at AAAI in February, so if you're at AAAI then stop by and talk.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I don't usually write up my technical work here, mostly because I spend enough hours as is doing technical writing. But a co-author, Jon Barker, recently wrote\u00a0a post on the NVIDIA Parallel For All blog about one of our papers &hellip; <a href=\"https:\/\/www.jsylvest.com\/blog\/2017\/12\/malconv\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[10],"tags":[24,3,34,40,38,18,20],"class_list":["post-1144","post","type-post","status-publish","format-standard","hentry","category-cs","tag-ai","tag-computer-science","tag-machine-learning","tag-ml","tag-neural-nets","tag-projects","tag-science","wpautop"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/s3sddF-malconv","jetpack-related-posts":[{"id":1126,"url":"https:\/\/www.jsylvest.com\/blog\/2017\/11\/ais-one-trick-pony-has-a-hell-of-a-trick\/","url_meta":{"origin":1144,"position":0},"title":"AI's \"one trick pony\" has a hell of a trick","author":"jsylvest","date":"10 November 2017","format":false,"excerpt":"The MIT Technology Review has a recent article by James Somers about error backpropagation, \"Is AI Riding a One-Trick Pony?\" Overall, I agree with the message in the article. We need to keep thinking of new paradigms because the SotA right now is very useful, but not correct in any\u2026","rel":"","context":"In &quot;CS \/ Science \/ Tech \/ Coding&quot;","block_context":{"text":"CS \/ Science \/ Tech \/ Coding","link":"https:\/\/www.jsylvest.com\/blog\/category\/cs\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1564,"url":"https:\/\/www.jsylvest.com\/blog\/2019\/06\/bart-barrage-of-random-transforms-for-adversarially-robust-defense\/","url_meta":{"origin":1144,"position":1},"title":"BaRT: Barrage of Random Transforms for Adversarially Robust Defense","author":"jsylvest","date":"19 June 2019","format":false,"excerpt":"This week I'm at CVPR \u2014 the IEEE's Computer Vision and Pattern Recognition Conference, which is a huge AI event. I'm currently rehearsing the timing of my talk one last time, but I wanted to take a minute between run-throughs to link to my co-author Steven Forsyth's wonderful post on\u2026","rel":"","context":"In &quot;CS \/ Science \/ Tech \/ Coding&quot;","block_context":{"text":"CS \/ Science \/ Tech \/ Coding","link":"https:\/\/www.jsylvest.com\/blog\/category\/cs\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.jsylvest.com\/blog\/wp-content\/uploads\/2019\/06\/img1_trans05_0000_crop.png?fit=436%2C436&ssl=1&resize=350%2C200","width":350,"height":200},"classes":[]},{"id":1043,"url":"https:\/\/www.jsylvest.com\/blog\/2017\/04\/will-ai-steal-our-jobs\/","url_meta":{"origin":1144,"position":2},"title":"Will AI steal our jobs?","author":"jsylvest","date":"5 April 2017","format":false,"excerpt":"As an AI researcher, I think I am required to have an opinion about this. Here's what I have to say to the various tribes. AI-pessimists: please remember that the Luddites have been wrong about technology causing\u00a0economic cataclysm\u00a0every time so far. We're talking about several consecutive centuries of wrongness. ((I\u2026","rel":"","context":"In &quot;Business \/ Economics&quot;","block_context":{"text":"Business \/ Economics","link":"https:\/\/www.jsylvest.com\/blog\/category\/business-2\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1124,"url":"https:\/\/www.jsylvest.com\/blog\/2017\/10\/national-ai-strategy\/","url_meta":{"origin":1144,"position":3},"title":"National AI Strategy","author":"jsylvest","date":"9 October 2017","format":false,"excerpt":"Some of my co-workers published a sponsored piece in the Atlantic calling for a national AI strategy,\u00a0which was tied in to\u00a0some discussions at the\u00a0Washington Ideas event. I'm 100% on board with the US having a strategy, but I want to offer one caveat: \"comprehensive national strategies\" are susceptible to becoming\u2026","rel":"","context":"In &quot;Business \/ Economics&quot;","block_context":{"text":"Business \/ Economics","link":"https:\/\/www.jsylvest.com\/blog\/category\/business-2\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":95,"url":"https:\/\/www.jsylvest.com\/blog\/2013\/02\/remembering-armen-alchian\/","url_meta":{"origin":1144,"position":4},"title":"Armen Alchian &#038; Unnecessary Mathematical Fireworks","author":"jsylvest","date":"27 February 2013","format":false,"excerpt":"Cato Daily Podcast :: Remembering Armen Alchian Don Boudreaux discussing Armen Alchian's preference for clear prose over \"mathematical pyrotechnics\" reminded me of a few neural networks researchers I know. I won't name names, because it wasn't a favorable comparison. There's far too much equation-based whizz-bangery going on in some papers.\u2026","rel":"","context":"In &quot;Business \/ Economics&quot;","block_context":{"text":"Business \/ Economics","link":"https:\/\/www.jsylvest.com\/blog\/category\/business-2\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1188,"url":"https:\/\/www.jsylvest.com\/blog\/2018\/02\/aies-2018\/","url_meta":{"origin":1144,"position":5},"title":"AIES 2018","author":"jsylvest","date":"9 February 2018","format":false,"excerpt":"Last week I attended the first annual conference on AI, Ethics & Society where I presented some work on a Decision Tree\/Random Forest algorithm that makes decisions that are less biased or discriminatory. ((In the colloquial rather than technical sense)) You can read all the juicy details in our paper.\u2026","rel":"","context":"In &quot;CS \/ Science \/ Tech \/ Coding&quot;","block_context":{"text":"CS \/ Science \/ Tech \/ Coding","link":"https:\/\/www.jsylvest.com\/blog\/category\/cs\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/www.jsylvest.com\/blog\/wp-json\/wp\/v2\/posts\/1144","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.jsylvest.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.jsylvest.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.jsylvest.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.jsylvest.com\/blog\/wp-json\/wp\/v2\/comments?post=1144"}],"version-history":[{"count":13,"href":"https:\/\/www.jsylvest.com\/blog\/wp-json\/wp\/v2\/posts\/1144\/revisions"}],"predecessor-version":[{"id":1159,"href":"https:\/\/www.jsylvest.com\/blog\/wp-json\/wp\/v2\/posts\/1144\/revisions\/1159"}],"wp:attachment":[{"href":"https:\/\/www.jsylvest.com\/blog\/wp-json\/wp\/v2\/media?parent=1144"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.jsylvest.com\/blog\/wp-json\/wp\/v2\/categories?post=1144"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.jsylvest.com\/blog\/wp-json\/wp\/v2\/tags?post=1144"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}