One problem I see in the article above is that the result is evaluated only by eyballing- which <i>is</i> actually a perfectly legitimate evaluation method; humans' language faculty, is, after all, the only process we know of that can correctly recognise and generate natural language expressions (and other, automatic, methods like BLEU scores, often simply automate the eyballing process).<p>However, if you're working with statistical language modelling, you should probably try to check the statistical properties of the resulting model! For example- measure perplexity, or mutual entropy, etc. This will give you a more concrete, and faster, way to evaluate your model than just generating some text and looking at it, trying to figure out where it came from. I'd suggest the author give that a try- they seem to have dipped a toe into statistical language modelling but be unwilling to go all the way with more advanced approaches (e.g. the cited paper, by Kassarnig, uses "n-grams, Justeson & Katz POS tag filter, recurrent neural networks, and latent Dirichlet allocation" - but the article author doesn't seem to have tried any of that on its own, let alone a combination, other than n-grams). The more advanced material can be daunting, but it's not actually such a huge leap from Markov chains. And the results can <i>sometimes</i> be worth the effort.<p>. . .<p>Btw- it sounds to me like the approach taken in the "N-Grams, take 2" section basically boils down to a PCFG (a Probabilistic CFG). It sounds like the probability of the next token is calculated conditional upon the probabilities of all preceding <i>strings</i> of tokens in a sentence, which is what a PCFG does (which is to say, PCFGs are not Markov). With an N high enough, that will, indeed, copy your text verbatim :)<p>. . .<p>The author should keep in mind that no matter what you do, Markov chains are always going to either look completely incoherent, or produce exact copies of their training text.<p>In fact, that's a bit of a problem with most statistical language modellng techniques. Even state of the art systems, that can produce very grammatical text, fail badly when it comes to producing <i>coherent</i> text (e.g. a perfectly reasonably structured recipe, where the listed ingredients are not used in the preparation instructions, etc). And that's something that no statistical measure of fit can capture, unfortunately. So you're left again with eyballing the end results and hoping for the best.<p>Edit: sorry, I'm babbling a bit. The way to use statistical evaluation metrics, like perplexity, is to first automatically evaluate your generated text, then eyball it to see how the metric matches your own "sense" of the text's grammaticality. So then, if you figure out that a low perplexity score produces more grammatical results, you favour models that produce low perplexity etc.