Ask HN: Algorithms for text fingerprinting?

92 pointsby vixsomnisalmost 10 years ago

I remember reading an article a year or so ago about (the NSA) identifying users based on how they write: vocabulary, spelling mistakes, grammar, dialect, and so on.This is interesting to me because it is extremely difficult to change the vocabulary I use in writing and speaking. Being able to estimate the amount of similarity between two pieces of text would be useful.The closest I can think of right now would be the proprietary algorithms used to check for plagiarism (for schools and universities, for instance).Are there any publicly available algorithms for this? Where can I go to learn more? (Academic journals?) Am I just DDGing the wrong search terms?

17 comments

j42almost 10 years ago

Figured I'd chime in here since I developed an algorithm recently that could be applied to this problem with some basic ML.Basically the first step would be shingling the text (choosing a sampling domain) and generating a MinHash struct (computationally cheap) which can then be used to find the "similarity" between sets, or, the "Jaccard Index."If you're clever about this, you can use HyperLogLogs to encode these MinHash structs gaining a great deal of speed with a marginal error rate, all while allowing for arbitrary N-levels of intersection.If you're looking to build a model to analyze two (or N) text bodies for stylometric similarities, I'd approach the problem in two steps:1) Minimize the relevant input text.- Use a bernoulli/categorical distribution to weight words according to uniqueness--NLP and sentiment extraction techniques may also help- Design a markov process to represent more complex phrasing patterns for the text as a whole- Filter by a variable threshold to minimize the resulting set of shingles/bins/"interesting nodes" into a computationally-manageable #2) Use an efficient MinHash intersection to compute a similarity vector (0-1) for the two texts.I think given the prevalence of training data (I mean, what's more ubiquitous than the written word...) you could probably tune this to a reasonable accuracy and efficient complexity.Just a 5m thought exercise, but if anyone else has ideas I'd be curious as well :)

评论 #9718616 未加载

moyixalmost 10 years ago

The relevant search term is "stylometry". One particular paper I remember is from Dawn Song's group at Berkeley a couple years back:<a href="http://www.cs.berkeley.edu/~dawnsong/papers/2012%20On%20the%20Feasibility%20of%20Internet-Scale%20Author%20Identification.pdf" rel="nofollow">http://www.cs.berkeley.edu/~dawnsong/papers/2012%20On%20the%...</a>There's a lot of public work on the topic, but it looks like right now the best place to look is still in academic papers (I don't know of any open source libraries, for example).

评论 #9718534 未加载

评论 #9718812 未加载

评论 #9717324 未加载

评论 #9718349 未加载

thatcatalmost 10 years ago

Jstylo might be what you're looking for. <a href="https://github.com/psal/jstylo" rel="nofollow">https://github.com/psal/jstylo</a>The same group also has created a text obfuscation tool called anonymouth that helps you obfuscate your word choices, but it has still yet to be released. <a href="https://psal.cs.drexel.edu/index.php/JStylo-Anonymouth" rel="nofollow">https://psal.cs.drexel.edu/index.php/JStylo-Anonymouth</a>

benten10almost 10 years ago

I may be late to the party, but finally my time to shine! As has been mentioned earlier, the field gou are looking for is called stylometry, and has almost a century of history behind it, and also the field if my thesis. After looking at what everyone has been saying, I just felt like copying and pasting my 20 page literature review here, but I'd recommend you look at narayanan at all (2012) paper (internet scale authorship attribution). The algorithm it uses is not particularly complex and would take you a week, tops, to implement if you put in a few hours a day, and that's including doing all the related research and catching up with the linear algebra involved if you need to.

gnuralmost 10 years ago

I've started a simular project myself recently, I check on various parameters (reading level score, words per sentence, syllables per word, sentences per paragraph, average word length, average syllable count) and calculate the distance between 2 texts / authors using a simple euclidean distance.I started out with the code provided on <a href="https://github.com/mac389/ToxTweet/blob/master/textanalyzer.py" rel="nofollow">https://github.com/mac389/ToxTweet/blob/master/textanalyzer....</a> I use it in a private project, but the results are promising!

pdpdalmost 10 years ago

I wonder if this is how the authors of Truecrypt where identified. I remember reading something similar about this in regards to coding style.I am sure the Truecrypt authors contributed to more than one project.

MalcolmDiggsalmost 10 years ago

When I was in college we turned in papers via "Turnitin" which checked for plagiarism and uniqueness etc.There's an interesting research paper about their algorithms here: <a href="https://www.cs.auckland.ac.nz/courses/compsci725s2c/archive/termpapers/jrotzky.pdf" rel="nofollow">https://www.cs.auckland.ac.nz/courses/compsci725s2c/archive/...</a>And if you search for "Turnitin Plagiarism Algorithm" I'm sure you'll find a few more resources.

评论 #9717671 未加载

nodelessnessalmost 10 years ago

This one time JK Rowling was found out writing by a pseudonym[1] using this program:<a href="http://evllabs.com/jgaap/w/index.php/Main_Page" rel="nofollow">http://evllabs.com/jgaap/w/index.php/Main_Page</a>[1] <a href="http://blogs.wsj.com/speakeasy/2013/07/16/the-science-that-uncovered-j-k-rowlings-literary-hocus-pocus/" rel="nofollow">http://blogs.wsj.com/speakeasy/2013/07/16/the-science-that-u...</a>

评论 #9718118 未加载

MasterScratalmost 10 years ago

JGAAP is pretty awesome, it has both Java API and GUI: <a href="https://github.com/evllabs/JGAAP" rel="nofollow">https://github.com/evllabs/JGAAP</a>JStylo that was already mentioned is based on JGAAP. You have some more here: <a href="http://evllabs.com/jgaap/w/index.php/FAQ#What_other_tools_are_out_there.3F" rel="nofollow">http://evllabs.com/jgaap/w/index.php/FAQ#What_other_tools_ar...</a>

gullalmost 10 years ago

Check this out: <a href="http://www.secretlifeofpronouns.com/exercises.php" rel="nofollow">http://www.secretlifeofpronouns.com/exercises.php</a>

评论 #9716961 未加载

amazing_josealmost 10 years ago

The results could be horrible, but can imagine a simple technique for hiding all those clues. Just send the text to google translate, translate it to an intermediate language and the back to the original one. I can warranty an excellent t rinse and clean. Change the intermediate language and you will change the features of the final text. Of course, you risk horrible semantic changes in the final text ;)UPDATE: fix typos.

评论 #9717975 未加载

评论 #9717755 未加载

CSDudealmost 10 years ago

There is a public domain fingerprinting tool, that is used in MOSS (Measure Of Software Similarity). You can get the ideas from there.<a href="http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf" rel="nofollow">http://theory.stanford.edu/~aiken/publications/papers/sigmod...</a>

bolomega10000almost 10 years ago

A simple one is based on analysing stop words. I guess you could do vector similarity of stop word relative frequency. You could try additional features such as word bigrams and trigrams and contain stop words. In other words, things like, "all the words the author uses that commonly surround 'of'" to select on stop word containing common phrases.There is something about the stop word use pattern that makes them harder to forge.I've never tried this and I don't know much more about it than that, so I strongly suggest you also find papers that treat authorship attribution by stop words.

评论 #9717618 未加载

fauxfauxpasalmost 10 years ago

somewhat related - excerpt from Cryptonomicon - The percussionist stands up. "Every radio operator has a distinctive style of keying—we call it his fist. With a bit of practice, our Y Service people can recognize different German operators by their fists—we can tell when one of them has been transferred to a different unit, for example."and this article by Schneier - Identifying People By Their Writing Style - <a href="https://www.schneier.com/blog/archives/2011/08/identifying_peo_2.html" rel="nofollow">https://www.schneier.com/blog/archives/2011/08/identifying_p...</a>

banealmost 10 years ago

Here's a quick one:1) tokenize each text into a different bag(set) of words.2) Compute the Jaccard index[1] using the two sets.Here's another1) tokenize each text into a multi-bag(set) of words, keeping track of token frequency2) keeping the token frequency, order the sets into lists3) map the lists of words onto an n-dimensional space (where n is say...all of the words into the two documents) as vectors4) compute the cosine similarity [2]Here's another:1) tokenize the texts into two bags of words2) compute the set difference going both ways.3) does either difference contain discriminator tokens that rule it out as being from that person4 (optional)): extend to 2-3-n-gramsHere's another (a variant of the one above):1) compute 1-2-3-n-grams from one of the texts2) insert the n-grams into a set3) compute the same for the second document and test for set membership4) compute the number of total n-grams from your second document5) compute (non-in-set/total-n-grams) * 100 to yield a "uniqueness" measure6) determine if the second document is "unique" enoughAnd another:1) assuming you have a sample corpus from a writer and want to know if a new text belongs in that corpus2) follow the method above but for step #1 and 2 do it with the entire reference corpusAnd another:1) produce an ontology of discriminator terms and categories unique to the writer2) use an (named entity recognition) NER tool of some kind to find those terms in each document3) use the set of found terms as an alternative to a bag of words for the Jaccard or Vector models aboveYou may need to play with stopword list removal, tokenization schemes and n-gram windows (for example, omitting 1-grams might focus the analysis on phrase usage vs. vocabulary usage)1 - <a href="https://en.wikipedia.org/wiki/Jaccard_index" rel="nofollow">https://en.wikipedia.org/wiki/Jaccard_index</a>2 - <a href="https://en.wikipedia.org/wiki/Cosine_similarity" rel="nofollow">https://en.wikipedia.org/wiki/Cosine_similarity</a>

评论 #9718751 未加载

espealmost 10 years ago

afaik, stylo (<a href="https://sites.google.com/site/computationalstylistics/stylo" rel="nofollow">https://sites.google.com/site/computationalstylistics/stylo</a>) is the academic go-to solution. it is even sporting a gui

wodenokotoalmost 10 years ago

You want to look for 'author attribution' as your keyword.There are 2 main ways for assessing author attribution. One is through stylistic markers, where you look for a set of predefined features. The is average length per paragraph, or the number of times 'whenever' is used. This is highly language dependant.The other way is through character n-gram analysis. You chose for which N you want to harvest N-grams and your author profile is the frequency of top 2000 n-grams and you compare this profile with a documents top 2000 n-grams and the profile with the shortest distance is your match.Robert Layton has a tutorial and some code on N-gram attribution on Github:* <a href="https://github.com/robertlayton/authorship_tutorials" rel="nofollow">https://github.com/robertlayton/authorship_tutorials</a>* <a href="https://github.com/robertlayton/author-detection" rel="nofollow">https://github.com/robertlayton/author-detection</a>And here's a list of papers I've reviewed while doing a similar project.[1] Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni. Gender, genre, andwriting style in formal written texts.23(3):321–346, 2003.[2] John F Burrows. ‘an ocean where each kind...’: Statistical analysis and some major determinantsof literary style. Computers and the Humanities, 23(4-5):309–321, 1989.[3] Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas. Sourcecode author identification based on n-gram author profiles. In Artificial Intelligence Applica- tions and Innovations, pages 508–515. Springer, 2006.[4] Sheena Gardner and Hilary Nesi. A classification of genre families in university student writing.Applied linguistics, 34(1):25–52, 2013.[6] John Houvardas and Efstathios Stamatatos. N-gram feature selection for authorship identifica- tion. In Artificial Intelligence: Methodology, Systems, and Applications, pages 77–86. Springer,2006.[7] Patrick Juola. Authorship attribution. Foundations and Trends in information Retrieval,1(3):233–334, 2006.[8] Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. N-gram-based author profilesfor authorship attribution. In Proceedings of the conference pacific association for computationallinguistics, PACLING, volume 3, pages 255–264, 2003.[9] Maarten Lambers and Cor J Veenman. Forensic authorship attribution using compression dis- tances to prototypes. In Computational Forensics, pages 13–24. Springer, 2009.[11] Fiona J Tweedie and R Harald Baayen. How variable may a constant be? measures of lexicalrichness in perspective. Computers and the Humanities, 32(5):323–352, 1998.[12] Cor J Veenman and Zhenshi Li. Authorship verification with compression features.[13] Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan Huang. A framework for authorship identifi-cation of online messages: Writing-style features and classification techniques. Journal of theAmerican Society for Information Science and Technology, 57(3):378–393, 2006.