>> Thus far, these perturbations have been limited to:<p><pre><code> + the subtraction of articles (a, an, the)
+ the subtraction of the second part of a verb contraction (e.g. “‘ve”, “‘ll”, “‘s”, “‘m”)
+ the replacement of a few common homophones with one of their counterparts (e.g. replacing “their” with “there”, “then” with “than”)
</code></pre>
Oooh, that's _very_ tricky what they're trying to do there.<p>"Perturbations" that cause grammatical sentences to become ungrammatical are
_very_ hard to create, for the absolutely practical reason that the only way
to know whether a sentence is ungrammatical is to check that a grammar rejects
it. And, for English (and generally natural languages) we have no (complete)
such grammars. In fact, that's the whole point of language modelling-
everyone's trying to "model" (i.e. approximate, i.e. guess at) the structure
of English (etc)... because nobody has a complete grammar of it!<p>Dropping a few bits off sentences may sound like a reasonable alternative (an
approximation of an ungrammaticalising perturbation) but, unfortunately, it's
really, really not that simple.<p>For instance, take the removal of articles: consider the sentence: "Give him
the flowers". Drop the "the". Now you have "Give him flowers". Which is
perfectly correct and entirely plausible, conversational, everyday English.<p>In fact, dropping words is de rigeur in language modelling, either to generate
skip-grams for training, or to clean up a corpus by removing "stop words"
(uninformative words like the the and and's) or generally, cruft.<p>For this reason you'll notice that the NUCLE corpus used in the CoNLL-2014
error correction task mentioned in the OP is not auto-generated, and instead consists of student
essays corrected by professors of English.<p>tl;dr: You can't rely on generating ungrammaticality unless you can generate
grammaticallity.