I get that the point is to be an introduction to the libraries and whatnot, but was I the only one who immediately thought of just using Counter?<p><pre><code> from collections import Counter
import re
[word for word, count in Counter(re.findall('\w*', text.lower())).items() if count == 1]</code></pre>
for anyone interested in more good beginner resources, I really enjoyed this youtube playlist on python NLTK <a href="https://www.youtube.com/watch?v=OGxgnH8y2NM&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v" rel="nofollow">https://www.youtube.com/watch?v=OGxgnH8y2NM&list=PLQVvvaa0Qu...</a><p>edit* I accidentally linked to another good playlist, but here's the first vid of the NLTK list from the same user <a href="https://www.youtube.com/watch?v=FLZvOKSCkxY" rel="nofollow">https://www.youtube.com/watch?v=FLZvOKSCkxY</a>
I counted word n-grams up to length 6 in a corpus of 6 billion words with Madoka, a Count-Min sketch algorithm.<p><a href="https://pypi.python.org/pypi/madoka" rel="nofollow">https://pypi.python.org/pypi/madoka</a>