The project I never got around to was extracting n-grams, making them nodes of a graph, then testing for badness variants using cliques and isomorphisms. Part of the reason I didn't move on it was 'graphyness," did not appear to add information that could be obtained using other, faster, clustering methods with more mature tools, and much smarter people than me were working on solving it.<p>Reasoning about the interpreted behaviour of a program as "malware," is a really fun problem. Detecting known malware, variants, work-alikes, known behaviours, and then intent and "badness," are layers of abstraction that can't be solved bottom-up.<p>Badness is an interpreted reflection of the effect of executed code, so this "missing interpreter" information means there is no pure functional definition or derivation of badness. The best possible solution is a risk calculation based on code reputation with the error rate covered by some kind of event driven insurance. (Where you can't have certainty, you can at least have compensation.)<p>IMO, malware analysis is an imperfect information game that we have spent a lot of time trying to solve as a perfect one, and you can see how AV products are running up against these limits now.
This is certainly clever... but it's not about indexing <i>all</i> n-grams of a large size, only the most-frequent <i>k</i> n-grams.<p>Most of my work with n-grams has been for indexing purposes, meaning you need <i>all</i> the n-grams, not just the most frequent ones.<p>So if I understand correctly... the application here is to take a large number of programs known to be infected with the same malware, and then run this to find the <i>large</i> chunks in common that will be more reliable as a malware signature in the future.<p>That's definitely cool. It kind of feels like a different technique from n-grams in the end, though, since presumably just <i>one</i> chunk indicates a true positive? Whereas with n-grams you're generally looking for a fuzzier statistical match?<p>I'm wondering what other applications this might have besides malware detection.
>Larger values of n are not tested due to computational burden or the fear of overfitting<p>I would say that many times larger n-grams are tested in the early stage, give poor result, and then are dropped in subsequent tests.