Reading through the paper, this reminds me of the close reading/distant reading paradigms in literary studies. They're effectively trying to build a machine learning model to reproduce the task of authorship attribution that analysts typically perform by hand. I have to give them kudos for their comprehensiveness in the feature engineering part and their attention to the numerous traps of authorship attribution (code reuse, multiple author...).<p>Yet one question that is not really considered is the political stakes of authorship attribution. When you look at the "suspected locations" of the malware authors, it's quite clear that they're mostly located in rogue states. But we also know that some of these attributions can be politically motivated rather than empirically grounded (Sony hacks). In the same way that language models reproduce racist/sexist language, this model might thus reproduce geopolitical bias in its authorship attribution.
Authorship style identification in natural language has good intuition that one can work with. However, such notion in binary executive sounds totally nonsense. The only possibility that makes it work might originate from code reuse in a single organization, which is a classical feature to look into for malware detection.<p>So I have no idea what new information this arxiv paper provides other than to introduce an academic topic full of fancy terminologies.