How Forensic Linguistics Identified J.K. Rowling

135 点作者 ahmadss将近 12 年前

16 条评论

junto将近 12 年前

Correction: Rowling was 'outed' by her lawyer's wife's friend.<a href="http://www.independent.co.uk/arts-entertainment/books/news/jk-rowling-angry-and-disappointed-after-law-firm-leaked-robert-galbraith-pseudonym-8718087.html" rel="nofollow">http://www.independent.co.uk/arts-entertainment/books/news/j...</a>If I was the law firm, I'd fire the lawyer.

评论 #6075520 未加载

评论 #6075252 未加载

评论 #6075298 未加载

评论 #6075487 未加载

评论 #6075667 未加载

评论 #6078615 未加载

评论 #6078429 未加载

GigabyteCoin将近 12 年前

"I called both of them yesterday and learned not only how the Rowling investigation worked, but about the fascinating world of forensic linguistics."Cringe.From my experience (gleaned from dutifully reading every Bitcoin-related article I can get my hands on) I am very wary of reading about any topic which the author admits to just having learnt about yesterday.The majority of the time, unfortunately, English majors aren't the best at understanding technology.

评论 #6077505 未加载

评论 #6076708 未加载

elchief将近 12 年前

PCA is a pretty neat technique. It's quite old too, invented by Pearson in the early 1900's.Basically, you find a "vector" that travels along the part of the data with the highest variance. Then you find an orthogonal vector that travels along the part with the next highest variance.You then have a set of vectors that explain all of the variance, that aren't correlated (because they're orthogonal), and are ranked by how much they explain.This can be useful in regression to get rid of correlated variables, or you can get rid of some of the low variance components if there are more columns than rows, which breaks OLS regression.Consider a new town that you want to get to know as quickly as possible. What is the best method? You start with the longest street, then take a left and travel the next longest street, and so on. You can get a pretty good idea about the town without seeing it all.

3minus1将近 12 年前

The analysis of word length is interesting. English has a lot of long, multi-syllabic Latin based words, and also a lot of short Germanic based words. I wonder the extent to which a higher percentage of long words indicates a preference for the Latin and vice versa.

评论 #6078091 未加载

praptak将近 12 年前

Automatic transformation of text to evade these methods seems feasible (google translate back and forth might be the crude first attempt.) Obviously there might exist more refined methods of identification. In case of a book it is probably hard not to ruin it this way but reviews, posts and such do not require such high standards.

评论 #6075485 未加载

评论 #6075319 未加载

评论 #6075719 未加载

评论 #6078151 未加载

评论 #6075878 未加载

评论 #6075445 未加载

gtani将近 12 年前

<a href="https://news.ycombinator.com/item?id=3613734" rel="nofollow">https://news.ycombinator.com/item?id=3613734</a>this is a tough thing to google for. Terms I used a few weeks ago- stylometry- authorship attribution/verification- grammatical analysis, plagiarism detection

hnha将近 12 年前

Way too much terms like "proof", "fact", "confirmation", "definitely" later on. Isn't something like this always with a lot of assumption and always with a bias from the samples? Everyone could happen to be writing like someone else. There is nothing that definitively makes writing different between people like a fingerprint (which, as I understand it, is biologically highly random).Analysing sites like HN to see indicators(!) for sockpuppets or generally correlation of likelihood between accounts' writing styles would rock!

cliveowen将近 12 年前

Is that really a website that uses a normally sized font and doesn't drown me with ads?Nah, I must be dreaming.

fortepianissimo将近 12 年前

All of the statistical analyses sound to be fairly easy to beat.Say you want to pretend to be another author: first build a language model of the target author, then use the model to single out sentences of high perplexity from your writing. Then, have the model "rewrite" your sentences by replacing your words with synonyms of higher n-gram probabilities according to the model. Similar things can be done to fool the character n-gram analyses, or analyses above words (e.g., parses).

评论 #6078964 未加载

评论 #6077594 未加载

waterlesscloud将近 12 年前

Pretty cool. Interesting too, since Rowling is probably the most imitated author in the world at the moment. I guess not by published authors, though.

georgemcbay将近 12 年前

Have there been any instances where "Forensic Linguistics" actually predicted an outcome that wasn't previously suspected and it turned out to be true? All of the examples I've heard of are it "confirming" things already suspected by other means.Either way it is still an interesting tool and a cool use of technology, but I'd be a lot more impressed if the software were fed the text to a large number of random books and it detected an instance (with very high likelihood) of some famous author writing under a pen name, and then had that confirmed.

Nycto将近 12 年前

Something similar could probably be done with code (if it hasn't been done already). I suppose auto-formatting and checkstyles might mute some things, but I imagine you could still get a read from things like variable names, class names, function length, etc.

评论 #6075893 未加载

MarkMc将近 12 年前

I'm curious about the ethics of this. Why is it OK to 'out' someone as the author of a book, but it's not OK to 'out' someone as gay?

评论 #6075740 未加载

评论 #6075783 未加载

评论 #6075757 未加载

brownbat将近 12 年前

s/b "How Forensic Linguistics Confirmed a Leak about Rowling"

mnglkhn2将近 12 年前

At the same time we can think of the whole thing as a smart marketing plot.

alxbrun将近 12 年前

I don't buy this 'outed' story one second.This is either marketing or fear of public reception of her non-Potter book (imagine the pressure she must have). Either way, this is crap.

评论 #6075627 未加载

评论 #6077616 未加载