For those that didn't click though, the tweets they show for their POS tagger are great:<p><pre><code> ikr smh he asked fir yo last name so he can add u on fb lololol
</code></pre>
and<p><pre><code> :o :/ :'( >:o (: :) >.< XD -__- o.O ;D :-) @_@ :P 8D :1 >:( :D =| ") :> ....
</code></pre>
I'd love to see what Word2Vec did for that first one.
CMU does some of the most interesting work around with data from Twitter. Some of the same people responsible for this worked on predicting NFL games with Twitter data[0]<p>[0]<a href="https://www.cs.cmu.edu/~nasmith/papers/sinha+dyer+gimpel+smith.mlsa13.pdf" rel="nofollow">https://www.cs.cmu.edu/~nasmith/papers/sinha+dyer+gimpel+smi...</a>
Tweet NLP from CMU.<p><pre><code> We provide a tokenizer, a part-of-speech tagger,
hierarchical word clusters, and a dependency parser for
tweets, along with annotated corpora and web-based
annotation tools.</code></pre>
Very cool. I did some authorship attribution using a twitter corpus from MSR during grad school. The size constraint for a tweet causes users to have a different writing style then on other document types, which makes for interesting NLP problems.