Ask HN: What are some good open-source language/textual analysis tools?

5 pointsby CoreSetover 10 years ago

I'm looking to do linguistic/textual analysis on a large amount of text I've scraped for a research project, finding stats like: frequently used words, associated topic clusters, gender estimations.I wrote the scraper myself, but the language analysis is something it seems it'd be easier to find OS and use out of the box or slightly modified.Anyone have any ideas/leads? Preference is for a script or process I can run from the CL to output the vitals.

4 comments

whitej125over 10 years ago

If you are a Python person (very popular language in the data sciences realm these days). Your gateway drug to linguistic and textual analysis is going to be NLTK.<a href="http://www.nltk.org/" rel="nofollow">http://www.nltk.org/</a>The free book and tutorials are great and you can get up and running pretty quickly.NLTK's lower learning curve is great for getting your head around NLP concepts. Once you start looking for increased function or performance... you'll find yourself graduating to a SciKit-Learn (<a href="http://scikit-learn.org/stable/" rel="nofollow">http://scikit-learn.org/stable/</a>).In the Java world... I think Mahout is/was popular. Quite a bit more setup to get through in order get this up and running.

评论 #9050375 未加载

manidoraisamyover 10 years ago

Stanford NLP is pretty good, if you are on java - <a href="http://nlp.stanford.edu/software/corenlp.shtml" rel="nofollow">http://nlp.stanford.edu/software/corenlp.shtml</a>You might also want to look at word2vec (implemented in most of the popular languages) - <a href="https://code.google.com/p/word2vec/" rel="nofollow">https://code.google.com/p/word2vec/</a>

wallflowerover 10 years ago

This seems to have some good starting pointers:<a href="http://blog.datadive.net/which-topics-get-the-upvote-on-hacker-news/" rel="nofollow">http://blog.datadive.net/which-topics-get-the-upvote-on-hack...</a>

评论 #9050383 未加载

biomimicover 10 years ago

This text summarizer will be open sourced soon: <a href="http://genopharmix.com/TuataraSum" rel="nofollow">http://genopharmix.com/TuataraSum</a>