Correction: Rowling was 'outed' by her lawyer's wife's friend.<p><a href="http://www.independent.co.uk/arts-entertainment/books/news/jk-rowling-angry-and-disappointed-after-law-firm-leaked-robert-galbraith-pseudonym-8718087.html" rel="nofollow">http://www.independent.co.uk/arts-entertainment/books/news/j...</a><p>If I was the law firm, I'd fire the lawyer.
"I called both of them <i></i>yesterday<i></i> and learned not only how the Rowling investigation worked, but about the fascinating world of forensic linguistics."<p><i>Cringe.</i><p>From my experience (gleaned from dutifully reading every Bitcoin-related article I can get my hands on) I am very wary of reading about any topic which the author admits to just having learnt about <i>yesterday</i>.<p>The majority of the time, unfortunately, English majors aren't the best at understanding technology.
PCA is a pretty neat technique. It's quite old too, invented by Pearson in the early 1900's.<p>Basically, you find a "vector" that travels along the part of the data with the highest variance. Then you find an orthogonal vector that travels along the part with the next highest variance.<p>You then have a set of vectors that explain all of the variance, that aren't correlated (because they're orthogonal), and are ranked by how much they explain.<p>This can be useful in regression to get rid of correlated variables, or you can get rid of some of the low variance components if there are more columns than rows, which breaks OLS regression.<p>Consider a new town that you want to get to know as quickly as possible. What is the best method? You start with the longest street, then take a left and travel the next longest street, and so on. You can get a pretty good idea about the town without seeing it all.
The analysis of word length is interesting. English has a lot of long, multi-syllabic Latin based words, and also a lot of short Germanic based words. I wonder the extent to which a higher percentage of long words indicates a preference for the Latin and vice versa.
Automatic transformation of text to evade these methods seems feasible (google translate back and forth might be the crude first attempt.) Obviously there might exist more refined methods of identification. In case of a book it is probably hard not to ruin it this way but reviews, posts and such do not require such high standards.
<a href="https://news.ycombinator.com/item?id=3613734" rel="nofollow">https://news.ycombinator.com/item?id=3613734</a><p>this is a tough thing to google for. Terms I used a few weeks ago<p>- stylometry<p>- authorship attribution/verification<p>- grammatical analysis, plagiarism detection
Way too much terms like "proof", "fact", "confirmation", "definitely" later on. Isn't something like this <i>always</i> with a lot of assumption and <i>always</i> with a bias from the samples? Everyone could happen to be writing like someone else. There is nothing that definitively makes writing different between people like a fingerprint (which, as I understand it, is biologically highly random).<p>Analysing sites like HN to see indicators(!) for sockpuppets or generally correlation of likelihood between accounts' writing styles would rock!
All of the statistical analyses sound to be fairly easy to beat.<p>Say you want to pretend to be another author: first build a language model of the target author, then use the model to single out sentences of high perplexity from your writing. Then, have the model "rewrite" your sentences by replacing your words with synonyms of higher n-gram probabilities according to the model. Similar things can be done to fool the character n-gram analyses, or analyses above words (e.g., parses).
Have there been any instances where "Forensic Linguistics" actually predicted an outcome that wasn't previously suspected and it turned out to be true? All of the examples I've heard of are it "confirming" things already suspected by other means.<p>Either way it is still an interesting tool and a cool use of technology, but I'd be a lot more impressed if the software were fed the text to a large number of random books and it detected an instance (with very high likelihood) of some famous author writing under a pen name, and then had that confirmed.
Something similar could probably be done with code (if it hasn't been done already). I suppose auto-formatting and checkstyles might mute some things, but I imagine you could still get a read from things like variable names, class names, function length, etc.
I'm curious about the ethics of this. Why is it OK to 'out' someone as the author of a book, but it's not OK to 'out' someone as gay?
I don't buy this 'outed' story one second.<p>This is either marketing or fear of public reception of her non-Potter book (imagine the pressure she must have). Either way, this is crap.