I'm looking to do linguistic/textual analysis on a large amount of text I've scraped for a research project, finding stats like: frequently used words, associated topic clusters, gender estimations.<p>I wrote the scraper myself, but the language analysis is something it seems it'd be easier to find OS and use out of the box or slightly modified.<p>Anyone have any ideas/leads? Preference is for a script or process I can run from the CL to output the vitals.
If you are a Python person (very popular language in the data sciences realm these days). Your gateway drug to linguistic and textual analysis is going to be NLTK.<p><a href="http://www.nltk.org/" rel="nofollow">http://www.nltk.org/</a><p>The free book and tutorials are great and you can get up and running pretty quickly.<p>NLTK's lower learning curve is great for getting your head around NLP concepts. Once you start looking for increased function or performance... you'll find yourself graduating to a SciKit-Learn (<a href="http://scikit-learn.org/stable/" rel="nofollow">http://scikit-learn.org/stable/</a>).<p>In the Java world... I think Mahout is/was popular. Quite a bit more setup to get through in order get this up and running.
Stanford NLP is pretty good, if you are on java - <a href="http://nlp.stanford.edu/software/corenlp.shtml" rel="nofollow">http://nlp.stanford.edu/software/corenlp.shtml</a><p>You might also want to look at word2vec (implemented in most of the popular languages) - <a href="https://code.google.com/p/word2vec/" rel="nofollow">https://code.google.com/p/word2vec/</a>
This seems to have some good starting pointers:<p><a href="http://blog.datadive.net/which-topics-get-the-upvote-on-hacker-news/" rel="nofollow">http://blog.datadive.net/which-topics-get-the-upvote-on-hack...</a>
This text summarizer will be open sourced soon: <a href="http://genopharmix.com/TuataraSum" rel="nofollow">http://genopharmix.com/TuataraSum</a>