How Forensic Linguistics Identified J.K. Rowling

135 pointsby ahmadssalmost 12 years ago

16 comments

juntoalmost 12 years ago

Correction: Rowling was 'outed' by her lawyer's wife's friend.<a href="http://www.independent.co.uk/arts-entertainment/books/news/jk-rowling-angry-and-disappointed-after-law-firm-leaked-robert-galbraith-pseudonym-8718087.html" rel="nofollow">http://www.independent.co.uk/arts-entertainment/books/news/j...</a>If I was the law firm, I'd fire the lawyer.

评论 #6075520 未加载

评论 #6075252 未加载

评论 #6075298 未加载

评论 #6075487 未加载

评论 #6075667 未加载

评论 #6078615 未加载

评论 #6078429 未加载

GigabyteCoinalmost 12 years ago

"I called both of them yesterday and learned not only how the Rowling investigation worked, but about the fascinating world of forensic linguistics."Cringe.From my experience (gleaned from dutifully reading every Bitcoin-related article I can get my hands on) I am very wary of reading about any topic which the author admits to just having learnt about yesterday.The majority of the time, unfortunately, English majors aren't the best at understanding technology.

评论 #6077505 未加载

评论 #6076708 未加载

elchiefalmost 12 years ago

PCA is a pretty neat technique. It's quite old too, invented by Pearson in the early 1900's.Basically, you find a "vector" that travels along the part of the data with the highest variance. Then you find an orthogonal vector that travels along the part with the next highest variance.You then have a set of vectors that explain all of the variance, that aren't correlated (because they're orthogonal), and are ranked by how much they explain.This can be useful in regression to get rid of correlated variables, or you can get rid of some of the low variance components if there are more columns than rows, which breaks OLS regression.Consider a new town that you want to get to know as quickly as possible. What is the best method? You start with the longest street, then take a left and travel the next longest street, and so on. You can get a pretty good idea about the town without seeing it all.

3minus1almost 12 years ago

The analysis of word length is interesting. English has a lot of long, multi-syllabic Latin based words, and also a lot of short Germanic based words. I wonder the extent to which a higher percentage of long words indicates a preference for the Latin and vice versa.

评论 #6078091 未加载

praptakalmost 12 years ago

Automatic transformation of text to evade these methods seems feasible (google translate back and forth might be the crude first attempt.) Obviously there might exist more refined methods of identification. In case of a book it is probably hard not to ruin it this way but reviews, posts and such do not require such high standards.

评论 #6075485 未加载

评论 #6075319 未加载

评论 #6075719 未加载

评论 #6078151 未加载

评论 #6075878 未加载

评论 #6075445 未加载

gtanialmost 12 years ago

<a href="https://news.ycombinator.com/item?id=3613734" rel="nofollow">https://news.ycombinator.com/item?id=3613734</a>this is a tough thing to google for. Terms I used a few weeks ago- stylometry- authorship attribution/verification- grammatical analysis, plagiarism detection

hnhaalmost 12 years ago

Way too much terms like "proof", "fact", "confirmation", "definitely" later on. Isn't something like this always with a lot of assumption and always with a bias from the samples? Everyone could happen to be writing like someone else. There is nothing that definitively makes writing different between people like a fingerprint (which, as I understand it, is biologically highly random).Analysing sites like HN to see indicators(!) for sockpuppets or generally correlation of likelihood between accounts' writing styles would rock!

cliveowenalmost 12 years ago

Is that really a website that uses a normally sized font and doesn't drown me with ads?Nah, I must be dreaming.

fortepianissimoalmost 12 years ago

All of the statistical analyses sound to be fairly easy to beat.Say you want to pretend to be another author: first build a language model of the target author, then use the model to single out sentences of high perplexity from your writing. Then, have the model "rewrite" your sentences by replacing your words with synonyms of higher n-gram probabilities according to the model. Similar things can be done to fool the character n-gram analyses, or analyses above words (e.g., parses).

评论 #6078964 未加载

评论 #6077594 未加载

waterlesscloudalmost 12 years ago

Pretty cool. Interesting too, since Rowling is probably the most imitated author in the world at the moment. I guess not by published authors, though.

georgemcbayalmost 12 years ago

Have there been any instances where "Forensic Linguistics" actually predicted an outcome that wasn't previously suspected and it turned out to be true? All of the examples I've heard of are it "confirming" things already suspected by other means.Either way it is still an interesting tool and a cool use of technology, but I'd be a lot more impressed if the software were fed the text to a large number of random books and it detected an instance (with very high likelihood) of some famous author writing under a pen name, and then had that confirmed.

Nyctoalmost 12 years ago

Something similar could probably be done with code (if it hasn't been done already). I suppose auto-formatting and checkstyles might mute some things, but I imagine you could still get a read from things like variable names, class names, function length, etc.

评论 #6075893 未加载

MarkMcalmost 12 years ago

I'm curious about the ethics of this. Why is it OK to 'out' someone as the author of a book, but it's not OK to 'out' someone as gay?

评论 #6075740 未加载

评论 #6075783 未加载

评论 #6075757 未加载

brownbatalmost 12 years ago

s/b "How Forensic Linguistics Confirmed a Leak about Rowling"

mnglkhn2almost 12 years ago

At the same time we can think of the whole thing as a smart marketing plot.

alxbrunalmost 12 years ago

I don't buy this 'outed' story one second.This is either marketing or fear of public reception of her non-Potter book (imagine the pressure she must have). Either way, this is crap.

评论 #6075627 未加载

评论 #6077616 未加载