I created haiku_robot (<a href="http://www.reddit.com/user/haiku_robot" rel="nofollow">http://www.reddit.com/user/haiku_robot</a>) on reddit and, from experience, found that it wasn't too worthwhile optimizing for accuracy. The cases where I got the syllable count wrong seemed to have an equal distribution of upvotes compared to the ones where I got it right and regional variations in pronunciation meant that I was accused of being wrong more often when I was right than when I was wrong.
This is pretty neat. I've been puttering on one on and off, but it's horribly broken so I haven't released it, so this one gets extra points for actually existing. :)<p>In case my half-done thoughts are useful to anyone looking to build something in this space:<p>My aim is/was to allow configurable matching, so you can match, e.g. "XxxXxx / XxxXxx1 / XxxXxx / XxxXxx1", meaning four consecutive lines of six syllables, where X is a stressed, and x an unstressed syllable, and where the last syllable of the 2nd and 4th lines have the same phoneme, denoted "1", whereas there are no phonemic constraints on any other syllables (this allows a crude approach to rhyme).<p>I'm not entirely happy with cmudict because, since it works one syllable at a time, it can't really do much about stress, which can vary depending on the surrounding words. I've been using the output of <i>espeak -x</i> instead, which gives a phonetic rendering of an entire sentence, including assigning both phonemes and stress. I'm not sure if it's genuinely an improvement though. Its poorly documented output surely isn't an improvement! And in particular it gives a normal prosaic reading of a sentence, which might be too constraining for poetry-finding, since poems often allow a bit of freedom on moving around the stresses.<p>The idea to scan large amounts of text is to compile the configurable pattern into a regex that matches espeak -x output, so for example X gets mapped to a "match any stressed syllable" regex snippet. Alas, that's error-prone, especially since the espeak -x phoneme format is a bit quirky (e.g. no fixed length per syllable or syllable markers, so you need to have some per-language rules to figure out what sequences of ASCII constitute what, which I haven't debugged).
Fantastic! This shows the possibilities of what can be created given the text on Gutenberg archives. Assuming all the fiction ever created is available on your laptop (quite feasible now, except of course, for the small matter copyright) what new expressions can be derived?<p>On a different note, I read the about section of the blog and saw that the OP, in addition to this great stuff, is a beekeeping, hacking attorney who also spins fire. Amazing!
for placing every moment of<p>the labourer's time and that of<p>his family at the<p>disposal of the<p>capitalist for the purpose of<p>greater quantity of labour<p>In addition to a measure<p>of its extension<p>ie duration<p>labour now acquires a measure<p>-Karl Marx