Deep Text Correcter

228 点作者 atpaino超过 8 年前

27 条评论

jmiserez超过 8 年前

Interesting idea. I went ahead and tested:> Alex went to the kitchen to store the milk in the fridge.Corrected:> Alex went to the kitchen to the store the milk in the fridge.Gathering a large, high quality dataset from the internet is probably not so easy. A lot of the content on HN/Reddit/forums is of low quality grammatically and often written by non-native English speakers (such as myself). Movie dialogues don't necessarily consist of grammatically correct sentences like the ones you'd write in a letter. Perhaps there is some public domain contemporary literature available that could be used instead or alongside the dialogues?EDIT: Unrelated to this project, I have this general fear of language recommendation tools trained on just low-quality comments or emails. A simple thesaurus and a grammar-checker are often enough to find the right words when writing. But a tool that could understand my intent and then propose restructured or similar sentences and words that convey the same meaning could be a true killer application.

评论 #13351295 未加载

评论 #13351361 未加载

评论 #13353010 未加载

评论 #13351354 未加载

评论 #13351385 未加载

评论 #13351999 未加载

评论 #13351897 未加载

评论 #13352481 未加载

daveytea超过 8 年前

This is really cool. If you're looking for more datasets to train your model, here are a few relevant ones: - <a href="https://archive.org/details/stackexchange" rel="nofollow">https://archive.org/details/stackexchange</a> - <a href="http://trec.nist.gov/data/qamain.html" rel="nofollow">http://trec.nist.gov/data/qamain.html</a> - <a href="http://opus.lingfil.uu.se/OpenSubtitles2016.php" rel="nofollow">http://opus.lingfil.uu.se/OpenSubtitles2016.php</a> - <a href="http://corpus.byu.edu/full-text/wikipedia.asp" rel="nofollow">http://corpus.byu.edu/full-text/wikipedia.asp</a> OR <a href="https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia" rel="nofollow">https://en.wikipedia.org/wiki/Wikipedia:Database_download#En...</a> - <a href="http://opus.lingfil.uu.se/" rel="nofollow">http://opus.lingfil.uu.se/</a>I'd love to see how good your model gets.

评论 #13351321 未加载

brandonb超过 8 年前

Interesting idea! I think this is analogous to the idea of a de-noising autoencoder in computer vision. Here, instead of introducing Gaussian noise at the pixel level and using a CNN, you're introducing grammatical "noise" at the world level and using an LSTM.I think that general framework applies to many different domains. For example, we trained a denoising sequence autoencoder on HealthKit data (sequences of step counts and heart rate measurements) in order to predict whether somebody is likely to have diabetes, high blood pressure, or a heart rhythm disorder based on wearable data. I've also seen similar ideas applied to EMR data (similar to word2vec). It's worth reading "Semi-Supervised Sequence Learning", where they use a non-denoising sequence autoencoder as a pretraining step, and compare a couple of different techniques: <a href="https://papers.nips.cc/paper/5949-semi-supervised-sequence-learning.pdf" rel="nofollow">https://papers.nips.cc/paper/5949-semi-supervised-sequence-l...</a>Toward the end, you start thinking about introducing different types of grammatical errors, like subject-verb disagreement. I think that's a good way to think about it. In the limit, you might even have a neural network generate increasingly harder types of grammatical corruptions, with the goal of "fooling" the corrector network. As the the corruptor network and corrector network compete with each other, you might end up with something like a generative adversarial network: <a href="https://arxiv.org/abs/1701.00160" rel="nofollow">https://arxiv.org/abs/1701.00160</a>

评论 #13355017 未加载

评论 #13353425 未加载

jrapdx3超过 8 年前

The "correcter" is a worthy effort, and it needs to start somewhere. It shows the magnitude of the task considering that missing articles are not the most crucial grammatical issues in on-line discourse. The meaning of a phrase is usually comprehensible with or without the article, and native speakers can easily overlook this kind of error made by non-native speakers.OTOH more troublesome to readers are common errors such as misuse of "its" vs. "it's", "to" vs. "too", and "their", "there" and "they're". These mistakes are quite prevalent among native-speaking writers so more ubiquitous than the missing article problem.The "correcter" didn't correct the latter class of errors. Understandably this would be a much harder goal to accomplish given the highly contextual nature of grammatically correct word choices.It prompts a question about how well the data-driven approach can handle the problem. Obviously that's what the research is trying to answer. It sure seems to point to something fairly easy for a human to do that's near or at the limit of what we can get a computer to do.

danso超过 8 年前

Tried out some of classic Garden path sentences [0], and of the 4 examples, it got all but one right:Original: The complex houses married and single soldiers and their families.Deep Text Corrector: The complex houses married and a single soldiers and their families.OT: does anyone know of a more substantial list of garden path sentences that people use in testing NLP software?[0] <a href="https://en.wikipedia.org/wiki/Garden_path_sentence" rel="nofollow">https://en.wikipedia.org/wiki/Garden_path_sentence</a>

saycheese超过 8 年前

Deep Proofreading Tool Comparisons:<a href="http://www.deepgrammar.com/evaluation" rel="nofollow">http://www.deepgrammar.com/evaluation</a><a href="https://blogs.nvidia.com/blog/2016/03/04/deep-learning-fix-grammar/" rel="nofollow">https://blogs.nvidia.com/blog/2016/03/04/deep-learning-fix-g...</a>

stephanheijl超过 8 年前

Looks like a cool project, I would love to see this as a browser plugin of some sort. As for the corpus, I suspect that using articles from Wikipedia would be appropriate. Especially large articles are routinely checked and cleaned up. It has the added benefit of being available in multiple languages.(<a href="https://en.wikipedia.org/wiki/Wikipedia:Database_download" rel="nofollow">https://en.wikipedia.org/wiki/Wikipedia:Database_download</a>)EDIT: I see this has already been suggested, along with a large amount of other source in another comment by daveytea.

camoby超过 8 年前

Is the spelling of the name supposed to be ironic? ;)

评论 #13351196 未加载

YeGoblynQueenne超过 8 年前

>> Thus far, these perturbations have been limited to:<pre><code> + the subtraction of articles (a, an, the) + the subtraction of the second part of a verb contraction (e.g. “‘ve”, “‘ll”, “‘s”, “‘m”) + the replacement of a few common homophones with one of their counterparts (e.g. replacing “their” with “there”, “then” with “than”) </code></pre> Oooh, that's _very_ tricky what they're trying to do there."Perturbations" that cause grammatical sentences to become ungrammatical are _very_ hard to create, for the absolutely practical reason that the only way to know whether a sentence is ungrammatical is to check that a grammar rejects it. And, for English (and generally natural languages) we have no (complete) such grammars. In fact, that's the whole point of language modelling- everyone's trying to "model" (i.e. approximate, i.e. guess at) the structure of English (etc)... because nobody has a complete grammar of it!Dropping a few bits off sentences may sound like a reasonable alternative (an approximation of an ungrammaticalising perturbation) but, unfortunately, it's really, really not that simple.For instance, take the removal of articles: consider the sentence: "Give him the flowers". Drop the "the". Now you have "Give him flowers". Which is perfectly correct and entirely plausible, conversational, everyday English.In fact, dropping words is de rigeur in language modelling, either to generate skip-grams for training, or to clean up a corpus by removing "stop words" (uninformative words like the the and and's) or generally, cruft.For this reason you'll notice that the NUCLE corpus used in the CoNLL-2014 error correction task mentioned in the OP is not auto-generated, and instead consists of student essays corrected by professors of English.tl;dr: You can't rely on generating ungrammaticality unless you can generate grammaticallity.

评论 #13354897 未加载

评论 #13353231 未加载

smoyer超过 8 年前

'Kvothe went to the market'Off-topic but I've been waiting for the third and final book in the trilogy for a long time ... I've come to the conclusion that Rothfuss can't find a way to tie all the plot threads together.I'm also wondering if anyone else thinks Rothfuss looks like Longfellow in a lot of his publicity shots.

topynate超过 8 年前

> Unfortunately, I am not aware of any publicly available dataset of (mostly) grammatically correct English.How about books?

评论 #13352097 未加载

评论 #13351381 未加载

raverbashing超过 8 年前

While there are limited errors it can officially correct, I tried a few phrases:Didn't fix misuse of its: "The tool worked on it's own power""He should of gone yesterday" gets corrected to "the He should of gone yesterday""To who does this belong?" doesn't get corrected"A Apple a day keeps the doctor away" doesn't change

评论 #13352269 未加载

zitterbewegung超过 8 年前

This is a neat project . I think as a follow up step he should compare it to Word's grammar checker .

ematvey超过 8 年前

Nice work! I was playing with exactly this idea for some time. Potentially it could be way bigger than simple grammatical corrections.My list of things to try, in addition to what you've already done:- replacing named entities with metadata-annotated tokens;- dropping random words, not just articles;- replacing random words with rarer synonyms;- annotate with POS tags from some external parser;- run syntax corrector before feeding sentences in grammatical model;I think this problem is easier that it appears on the surface. Generated deformation does not have to be a perfect replica of typical human errors. It just have to be sufficiently diverse.Also, I think seq2seq module is getting deprecated, as it doesn't do dynamic rollouts.

WhitneyLand超过 8 年前

Alex nice work this is exciting. I've been wanting to work on something similar because the quality of common grammar checkers (like MS Word) has made such little progress.Have you considered combining your approach with rules based system? Some systems using only an elaborate set of rules for common mistakes have had pretty good performance. I wonder if these two approaches could be combined.Btw, what is the highest performing grammar checker you've found that's commonly available?

hbornfree超过 8 年前

This is terrific! Very coincidentally, we were thinking of implementing a sentence de-noiser using sequence-to-sequence models only today evening. I work in the NLP domain writing Machine Translation systems. But NLP parsers are accurate for grammatically correct sentences only which necessitates the need for something like deep text correcter. Thank you for this. Will try this out this week and let you know how it goes.

UhUhUhUh超过 8 年前

Re. intent but with regards to spelling. I often wonder if there could be rules to correct errors due to key strokes in the immediate vicinity of the intended letter (e.g. "keu" vs. "key"). Would check combination of the surrounding letters, first oin (that's an unintended addition here) horizontal axis. That happens to me all the time, probably because I'm not a good typist but still.

OJFord超过 8 年前

I can't make it work for anything other than the missing 'the' example.For example:> Do you know where I been'corrects' to:> Do you know where I 's been

mikeflynn超过 8 年前

Interesting project, and I love the example they used on the demo page. (Go Cardinals!)

ashildr超过 8 年前

And so it begins: <a href="http://www.goodreads.com/book/show/13184491-avogadro-corp" rel="nofollow">http://www.goodreads.com/book/show/13184491-avogadro-corp</a>

macawfish超过 8 年前

"I gotta take shit"->"I gotta take the shit"Sorry to be airing out my personal business and everything but... everybody poops!

评论 #13353626 未加载

grizzles超过 8 年前

For training data, you could try ebook torrents, eg. books with a creative commons license.

koliber超过 8 年前

Would the works archived in Project Gutenberg be a good training corpus?

burnbabyburn超过 8 年前

isn't a ngram bayesian model sufficient here?also, isn't the test data linear dependent from the training set here so creating skewed performance measurement?

guelo超过 8 年前

What about using books such as from Project Gutenberg?

sigmonsays超过 8 年前

but will it correct "lets eat grandma"

vacri超过 8 年前

> "Kvothe went to market"This is not a grammatically incorrect sentence; it depends on context. Products are take to an abstract concept of 'market', for example.