TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

KiloGrams: Large N-Grams for Malware Classification

51 pointsby adulaualmost 6 years ago

4 comments

motohagiographyalmost 6 years ago
The project I never got around to was extracting n-grams, making them nodes of a graph, then testing for badness variants using cliques and isomorphisms. Part of the reason I didn&#x27;t move on it was &#x27;graphyness,&quot; did not appear to add information that could be obtained using other, faster, clustering methods with more mature tools, and much smarter people than me were working on solving it.<p>Reasoning about the interpreted behaviour of a program as &quot;malware,&quot; is a really fun problem. Detecting known malware, variants, work-alikes, known behaviours, and then intent and &quot;badness,&quot; are layers of abstraction that can&#x27;t be solved bottom-up.<p>Badness is an interpreted reflection of the effect of executed code, so this &quot;missing interpreter&quot; information means there is no pure functional definition or derivation of badness. The best possible solution is a risk calculation based on code reputation with the error rate covered by some kind of event driven insurance. (Where you can&#x27;t have certainty, you can at least have compensation.)<p>IMO, malware analysis is an imperfect information game that we have spent a lot of time trying to solve as a perfect one, and you can see how AV products are running up against these limits now.
crazygringoalmost 6 years ago
This is certainly clever... but it&#x27;s not about indexing <i>all</i> n-grams of a large size, only the most-frequent <i>k</i> n-grams.<p>Most of my work with n-grams has been for indexing purposes, meaning you need <i>all</i> the n-grams, not just the most frequent ones.<p>So if I understand correctly... the application here is to take a large number of programs known to be infected with the same malware, and then run this to find the <i>large</i> chunks in common that will be more reliable as a malware signature in the future.<p>That&#x27;s definitely cool. It kind of feels like a different technique from n-grams in the end, though, since presumably just <i>one</i> chunk indicates a true positive? Whereas with n-grams you&#x27;re generally looking for a fuzzier statistical match?<p>I&#x27;m wondering what other applications this might have besides malware detection.
评论 #20593639 未加载
评论 #20593154 未加载
EdwardRaffalmost 6 years ago
Paper author here, happy to answer questions! I&#x27;ll try and pop in as I have time in the day :)
评论 #20594038 未加载
评论 #20593950 未加载
woliveirajralmost 6 years ago
&gt;Larger values of n are not tested due to computational burden or the fear of overfitting<p>I would say that many times larger n-grams are tested in the early stage, give poor result, and then are dropped in subsequent tests.
评论 #20593732 未加载