TechEcho

4 comments

The project I never got around to was extracting n-grams, making them nodes of a graph, then testing for badness variants using cliques and isomorphisms. Part of the reason I didn't move on it was 'graphyness," did not appear to add information that could be obtained using other, faster, clustering methods with more mature tools, and much smarter people than me were working on solving it.Reasoning about the interpreted behaviour of a program as "malware," is a really fun problem. Detecting known malware, variants, work-alikes, known behaviours, and then intent and "badness," are layers of abstraction that can't be solved bottom-up.Badness is an interpreted reflection of the effect of executed code, so this "missing interpreter" information means there is no pure functional definition or derivation of badness. The best possible solution is a risk calculation based on code reputation with the error rate covered by some kind of event driven insurance. (Where you can't have certainty, you can at least have compensation.)IMO, malware analysis is an imperfect information game that we have spent a lot of time trying to solve as a perfect one, and you can see how AV products are running up against these limits now.

crazygringoalmost 6 years ago

This is certainly clever... but it's not about indexing all n-grams of a large size, only the most-frequent k n-grams.Most of my work with n-grams has been for indexing purposes, meaning you need all the n-grams, not just the most frequent ones.So if I understand correctly... the application here is to take a large number of programs known to be infected with the same malware, and then run this to find the large chunks in common that will be more reliable as a malware signature in the future.That's definitely cool. It kind of feels like a different technique from n-grams in the end, though, since presumably just one chunk indicates a true positive? Whereas with n-grams you're generally looking for a fuzzier statistical match?I'm wondering what other applications this might have besides malware detection.

评论 #20593639 未加载

评论 #20593154 未加载

EdwardRaffalmost 6 years ago

Paper author here, happy to answer questions! I'll try and pop in as I have time in the day :)

评论 #20594038 未加载

评论 #20593950 未加载

woliveirajralmost 6 years ago

>Larger values of n are not tested due to computational burden or the fear of overfittingI would say that many times larger n-grams are tested in the early stage, give poor result, and then are dropped in subsequent tests.

评论 #20593732 未加载

4 comments

motohagiographyalmost 6 years ago

crazygringoalmost 6 years ago

评论 #20593639 未加载

评论 #20593154 未加载

EdwardRaffalmost 6 years ago

Paper author here, happy to answer questions! I'll try and pop in as I have time in the day :)

评论 #20594038 未加载

评论 #20593950 未加载

woliveirajralmost 6 years ago

评论 #20593732 未加载

KiloGrams: Large N-Grams for Malware Classification

4 comments

KiloGrams: Large N-Grams for Malware Classification

4 comments