Lead author here. Since my serious thinking on this topic started when I responded to this Ask HN post[1] Π years ago[2], it's nice to see this posted here, to come full circle in a sense. Happy to answer any questions.<p>[1] <a href="http://news.ycombinator.com/item?id=413730" rel="nofollow">http://news.ycombinator.com/item?id=413730</a><p>[2] No, really, it's been exactly Π years to the day :-)
Reading this it came to mind and is perhaps worth mentioning that this is how the Unabomber (Ted Kaczynski), the Luddite who engaged in a mail bombing campaign that spanned nearly 20 years, was caught.<p><i>Before the publication of the manifesto, Theodore Kaczynski's brother, David Kaczynski, was encouraged by his wife Linda to follow up on suspicions that Ted was the Unabomber. David Kaczynski was at first dismissive, but progressively began to take the likelihood more seriously after reading the manifesto a week after it was published in September 1995. David Kaczynski browsed through old family papers and found letters dating back to the 1970s written by Ted and sent to newspapers protesting the abuses of technology and which contained phrasing similar to what was found in the Unabomber Manifesto</i><p><a href="http://en.wikipedia.org/wiki/Ted_Kaczynski#Search" rel="nofollow">http://en.wikipedia.org/wiki/Ted_Kaczynski#Search</a>
While impressive, I don't think these results are actually that bad for privacy. 80% precision, for example, is useless when you're matching against tens of millions. It's much the same fallacy of the medical test for a disease that occurs in 1 out of 1000 people, and which has 99% accuracy -- but that still means a 90% false positive rate.<p>It reminds me of the claims of being able to identify, for example, the gender of an author with ~65% accuracy -- which is really actually completely unimpressive, as it's hardly better than guessing, and certainly not something you could rely on for any serious purpose.<p>The author mentions that topic is one way to help correlate beyond the results of the algorithm. But if I wrote "anonymous" posts in my area of expertise, you certainly would not need stylistic analysis to guess what my identity might be! There has never been privacy in this regard, I don't think.<p>Where privacy is needed most, I think, is exactly where this deanonymizing tool still isn't sufficient: talking about <i>unrelated topics</i>. A person should be free to express themselves under multiple names for different purposes, and there is no reason why an employer needs to know about a programmer's side hobby as a fiction writer if s/he doesn't want them to.<p>Finally, I do wonder how well these results correlate to the case where someone is <i>intentionally</i> operating under a different name. Matching one post by tech blogger A against blogger A is easy, because tech blogger A is making no attempt to write any differently or in any different context. However, what if tech-writer A ghost-wrote YA fiction on the side? Could you use these techniques to detect that the fiction was written by that blogger? It can't be ruled out without trying, but generalizing these results to that seems questionable.
The <i>difficulty</i> of doing it cross-context is actually slightly more surprising to me than the possibility. I would've guessed that, once a suitable data set were found (a main impediment to previous studies), accuracy would be quite good, along the lines of how easy it is to guess browser fingerprints from a few dozen telltale markers. But it appears that only about 10% of authors can be guessed to a precision of 80%, which is still pretty decent odds of not being identified automatically, at least for now, even without actively trying to cover up (though the linked post is right that with a specific target, intelligently adding some ad-hoc additional features can probably help).<p>One thing that'd be interesting to me is whether there are certain characteristics that make it particularly easy to identify people cross-context, like a top-10-telltale-markers sort of thing. Are a disproportionate number of the 10% who can be identified with high precision using a handful of unusual grammatical or lexical features, or is it more of a diffuse sort of thing?
Funny, I just started reading about adversarial stylometry the other day. <a href="https://www.cs.drexel.edu/~mb553/stuff/Indiana_20110407.pdf" rel="nofollow">https://www.cs.drexel.edu/~mb553/stuff/Indiana_20110407.pdf</a>
That's a very interesting paper (and very accessible to anyone with a stats/data mining background). I went back and read Jason Baldridge's intro, which is excellent<p><a href="http://ata-s12.utcompling.com/schedule/ATA-Authorship%20Attribution.pdf?attredirects=0" rel="nofollow">http://ata-s12.utcompling.com/schedule/ATA-Authorship%20Attr...</a><p>It seems you didn't attempt to fingerprint for misspellings, among the variables on pdf p 5. Also, curious why did you need to up the dataset to exactly 100k with the 5.7k.
Location can (sometimes) also be detected from writing style:<p><a href="http://www.cmu.edu/news/archive/2011/January/jan7_twitterdialects.shtml" rel="nofollow">http://www.cmu.edu/news/archive/2011/January/jan7_twitterdia...</a>
The privacy implications are a bit worrisome. Perhaps it's time to write utilities to anonymize your writing style.<p>Maybe running your text through a round-trip translator could help? (although then you'd need to fix any errors introduced).
I know that I semi-consciously engage in a few spelling anachronisms that probably serve to isolate me. Actually, since I recognized both them and their likely effect, I've become somewhat more conscious in applying them -- or in checking for them while proofreading and deciding whether to leave them in.
"<i>Developing fully automated methods to hide traces of one’s writing style remains a challenge</i>". How would the following 3 methods fare?<p>Method1: Run the text through a markov chain constructed maybe from a mixture of 0.5 your text, 0.25 Shakespeare and 0.25 Alice in wonderland. Do something like sample every third word with the other two coming as a chain. Then run that text through wordnet to do synonym based replacement.<p>Method 2: Do a translation to a nearby language and back again using some language translating api.<p>Method 3: Replace less common words with hypernyms and more common words with synonyms or possibly not + antonyms.<p>Might want a few heuristics to replace stuff like (, ..., ) , - , : ,[,] with each other. Also randomize space between punctuation.<p>Optionally Run the outputs through mechanical turk to iron out the result, leave as is or clean by self.