科技回声

13 条评论

POS tagging is a nice touch, but I've long wondered if there's a way to combine Markov sentence generation with actual grammatical rules -- that if there's an opening parenthesis there has to be a closing one, dependent clauses, etc. It would need to both gramatically parse the inputs as well as produce grammar trees for the outputs that would then be filled in by Markov chains that fit the trees...Heck, even do it at a level larger than sentences too, so questions are followed by answers, a line of dialog is followed by a response, paragraphs follow a realistic distribution of lengths...

评论 #18167790 未加载

piceas超过 6 年前

Children's poetry makes for fun source texts for these projects. In this case dumping a bunch of old poems generated mostly gibberish but I found the following examples amusing.While here on my deathbed I try to relate My many misfortunes and miseries great. Poor thoughtless young thing! If I recollect right, I began life in March, on a clear frosty night; And before I could see or was half a week old, I nearly had perished, the barn was so cold. But this chilly spring I got pretty well over, But there wasn't a tree for miles around, They were too frightened to stay on the ground, And moused in the stable, or played in the clover, Or till I was weary, which seldom occurred, Ran after my tail, which I took for a birdThe wind did blow, the cloak did fly, Like streamer long and gay, Till loop and button failing both, At last it flew away. Then might all people well discern The bottles he had slung, A bottle swinging at each side, As hath been said or sung. The dogs did bark, the children screamed, Up flew the windows all, And ev'ry soul cried out, Well done!When Betty screaming came down stairs, The wine is left behind! Good lack! Quoth he yet bring it me, My leathern belt likewise, In which I bear my trusty sword When I do exercise. Now Mistress Gilpin, careful soul!There as the mother sits all day, On business from their houses, And late at night returning home, To cheer their babes and spouses; While you and I have oft-times heard How men are killed and undone, By overturns from carriages, By thieves, and fires in London. We know what risks these landsmen run, From noblemen to tailors.

评论 #18168170 未加载

评论 #18167975 未加载

karanlyons超过 6 年前

You’re going to want to look at <a href="https://en.wikipedia.org/wiki/Kneser–Ney_smoothing" rel="nofollow">https://en.wikipedia.org/wiki/Kneser–Ney_smoothing</a> for further improvements on an ngram based approach.

评论 #18166937 未加载

评论 #18165832 未加载

hantusk超过 6 年前

Markov chains are cool and text generation is fun for experiments. I would definitely recommend looking at deep learning language models, for a more fruitful investment of time to improve further. (courses on <a href="http://www.fast.ai/" rel="nofollow">http://www.fast.ai/</a> are a good starting place)

cmroanirgo超过 6 年前

I too played around with War and Peace...but in the russian.Shameless plug: My password generator that takes text inputs and generates wordish results: <a href="https://cmroanirgo.github.io/wordish/" rel="nofollow">https://cmroanirgo.github.io/wordish/</a>(It's much simpler and less correct than OP's... but that's the point anyhow)The site is pre configured to use a few different source text excerpts:- The Happy Prince, Oscar Wilde- Matthew 5, King James Bible- War and Peace, Tolstoy (russian)- klingon, anonymous web source- Lorem Ipsum, lorem ipsim generator

donquichotte超过 6 年前

A nice experiment! I am always a bit surprised to see that CNNs & company produce much less coherent and natural sounding text than the comparably simple Markov Chain generators.That said, I markov-generated some text with Nietzsche's "Also sprach Zarathustra" as a source some time ago and it made more sense and was more coherent than the original.

评论 #18166410 未加载

YeGoblynQueenne超过 6 年前

One problem I see in the article above is that the result is evaluated only by eyballing- which is actually a perfectly legitimate evaluation method; humans' language faculty, is, after all, the only process we know of that can correctly recognise and generate natural language expressions (and other, automatic, methods like BLEU scores, often simply automate the eyballing process).However, if you're working with statistical language modelling, you should probably try to check the statistical properties of the resulting model! For example- measure perplexity, or mutual entropy, etc. This will give you a more concrete, and faster, way to evaluate your model than just generating some text and looking at it, trying to figure out where it came from. I'd suggest the author give that a try- they seem to have dipped a toe into statistical language modelling but be unwilling to go all the way with more advanced approaches (e.g. the cited paper, by Kassarnig, uses "n-grams, Justeson & Katz POS tag filter, recurrent neural networks, and latent Dirichlet allocation" - but the article author doesn't seem to have tried any of that on its own, let alone a combination, other than n-grams). The more advanced material can be daunting, but it's not actually such a huge leap from Markov chains. And the results can sometimes be worth the effort.. . .Btw- it sounds to me like the approach taken in the "N-Grams, take 2" section basically boils down to a PCFG (a Probabilistic CFG). It sounds like the probability of the next token is calculated conditional upon the probabilities of all preceding strings of tokens in a sentence, which is what a PCFG does (which is to say, PCFGs are not Markov). With an N high enough, that will, indeed, copy your text verbatim :). . .The author should keep in mind that no matter what you do, Markov chains are always going to either look completely incoherent, or produce exact copies of their training text.In fact, that's a bit of a problem with most statistical language modellng techniques. Even state of the art systems, that can produce very grammatical text, fail badly when it comes to producing coherent text (e.g. a perfectly reasonably structured recipe, where the listed ingredients are not used in the preparation instructions, etc). And that's something that no statistical measure of fit can capture, unfortunately. So you're left again with eyballing the end results and hoping for the best.Edit: sorry, I'm babbling a bit. The way to use statistical evaluation metrics, like perplexity, is to first automatically evaluate your generated text, then eyball it to see how the metric matches your own "sense" of the text's grammaticality. So then, if you figure out that a low perplexity score produces more grammatical results, you favour models that produce low perplexity etc.

评论 #18165966 未加载

jerrre超过 6 年前

I feel for fooling a quick skim, getting interpunction/capitalization correct/believable was the biggest improvement.

评论 #18165883 未加载

jwilk超过 6 年前

What does "cweep" mean?

评论 #18165822 未加载

firefoxd超过 6 年前

Your prose is too prolix.

honzajde超过 6 年前

Was fun to read. Thankd

sova超过 6 年前

Very impressed and delighted with your example results

huhtenberg超过 6 年前

This is going to be used for building better link farms in 3, 2, 1...

13 条评论

crazygringo超过 6 年前

评论 #18167790 未加载

piceas超过 6 年前

评论 #18168170 未加载

评论 #18167975 未加载

karanlyons超过 6 年前

评论 #18166937 未加载

评论 #18165832 未加载

hantusk超过 6 年前

cmroanirgo超过 6 年前

donquichotte超过 6 年前

评论 #18166410 未加载

YeGoblynQueenne超过 6 年前

评论 #18165966 未加载

jerrre超过 6 年前

I feel for fooling a quick skim, getting interpunction/capitalization correct/believable was the biggest improvement.

评论 #18165883 未加载

jwilk超过 6 年前

What does "cweep" mean?

评论 #18165822 未加载

firefoxd超过 6 年前

Your prose is too prolix.

honzajde超过 6 年前

Was fun to read. Thankd

sova超过 6 年前

Very impressed and delighted with your example results

huhtenberg超过 6 年前

This is going to be used for building better link farms in 3, 2, 1...

A kinda okay text generator

13 条评论

A kinda okay text generator

13 条评论