Ask HN: What are the best tools for analyzing large bodies of text?

83 pointsby CoreSetalmost 10 years ago

I'm a researcher in the social sciences, working on a project that requires me to scrape a large amount of text and then use NLP to determine things like sentiment analysis, LSM compatibility, and other linguistic metrics for different subsections of that content.The issue: After weeks of work, I've scraped all this information (a few GB's worth) and begun to analyze it using a mixture of Node, Python and bash scripts. In order to generate all of the necessary permutations of this data (looking at Groups A, B, and C together, A & C, A & B, etc), I've generated an unwieldy number of text files (the script generated > 50 GB before filling up my pitiful MBP hard drive), which I understand is no longer sustainable.The easiest way forward is loading this all into a database I can query to analyze different permutations of populations. I don't have much experience with SQL, but it seems to fit here.So how do I put all these .txt files into a SQL or NoSQL database? Are there any tools I could use to visualize this data (IntelliJ, my editor, keeps breaking). And where should I do all this work? I'm thinking now either an external hard drive, or on a VPS I can just tunnel into.Thanks in advance for your advice HN!

33 comments

drallisonalmost 10 years ago

It seems to me that you are driving from the wrong direction. Given that you have a large body of text, what is it you want to learn about/from the text. Collecting and applying random tools and making measurements without some inkling about what you want or expect to discover makes no sense. Tell us more about the provenance of your corpus of text and what sort of information you want to derive from the data.

评论 #9736579 未加载

评论 #9736986 未加载

评论 #9738413 未加载

评论 #9737103 未加载

rasengan0almost 10 years ago

>a project that requires me to scrape a large amount of text and then use NLP to determine things like sentiment analysis, LSM compatibility, and other linguistic metrics for different subsections of that content.I ran into a similar project and found this helpful working with the unstructured data: <a href="https://textblob.readthedocs.org/en/dev/" rel="nofollow">https://textblob.readthedocs.org/en/dev/</a> <a href="https://radimrehurek.com/gensim/" rel="nofollow">https://radimrehurek.com/gensim/</a>

cmarciniakalmost 10 years ago

More information would be helpful. In terms of having a data store that you can easily query text I would recommend Elasticsearch. Kibana is a dashboard built on Elasticsearch for performing analytics and visualization on your texts. Elasticsearch also has a RESTful api which would play nicely with your Python scripts or any scripting language for that matter. I would also recommend the Python package gensim for your NLP.

评论 #9737709 未加载

lsiebertalmost 10 years ago

What social science?You shouldn't be generating the text in advance and then processing it. You should be dynamically generating the text in memory, so you basically only have to worry about the memory for one text file at a time.As for visualizations, R and ggplot2 may work (R can handle text and data munging, as well as sentiment analysis etc.) It may be worth using it as a social scientist.ggplot2 has a python port.That said, you are probably using nltk, right? There are some tools in nltk.draw. There is probably also a user's mailing list for whatever package or tool you are using, consider asking this there.

评论 #9736045 未加载

评论 #9737096 未加载

评论 #9736491 未加载

nutatealmost 10 years ago

Right now the fastest alternative to nltk is spaCy <a href="https://honnibal.github.io/spaCy/" rel="nofollow">https://honnibal.github.io/spaCy/</a> definitely worth a look. I don't know what you're trying to do with the permutations part, but it seems like you can generate those on the fly through some reproducible algorithm (such that some integer seed describes the ordering in a reproducible way) then just keep track of the seeds, not the permuted data.

mark_l_watsonalmost 10 years ago

One approach is to put text files in Amazon S3 and write map reduce jobs that you can run with Elastic MapReduce. I did this a number of years ago for a customer project and it was inexpensive and a nice platform to work with. Microsoft, Google, and Amazon all have data warehousing products you can try if you don't want to write MapReduce jobs.That said, if you are only processing 2 GB of text, you can often do that in memory on your laptop. This is especially true if you are doing NLP on individual sentences, or paragraphs.

ChuckMcMalmost 10 years ago

Well if you're willing to relocate to the Bay Area I could set you up in an office with a variety of tools to analyze and classify text and do all sorts of analysis on it, I'd even pay you :-) (yes I've got a couple of job openings where this pretty much describes the job)That said, converting large documents into data sets is examined in a number of papers, you may find yourself getting more traction by splitting the problem up that way, step 1) pull apart the document into a more useful form, then step 2) do analysis on those parts. They are interelated of course and some forms of documents lend themselves to disassembly better than others (scientific papers for example, easy to unpack, random blog posts, less so.As for "where" to do it, the ideal place is a NoSQL cluster. This is what we've done at Blekko for years (and continue to do post acquisition) which is put the documents we crawl from the Internet into a giant NoSQL data base and then run jobs that execute in parallel across all of those servers to analyze those documents (traditionally to build a search index, but other modalities are interesting too.

koopulurialmost 10 years ago

What tools exactly are you using in Node and Python? Python has a nice data analysis tool Pandas(<a href="http://pandas.pydata.org/" rel="nofollow">http://pandas.pydata.org/</a>) which would help with your task of generating multiple permutations of your data. Check out plot.ly to visualize the data (it integrates well with a pandas pipeline from experience); It would also help if you mentioned exactly what kind of visualizations you're looking to create from the data.With regards to your issue of scale, this might help: <a href="http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas" rel="nofollow">http://stackoverflow.com/questions/14262433/large-data-work-...</a>I had similar issues when doing research in computer science, and I feel a lot of researchers working with data have this headache of organizing, visualizing and scaling their infrastructure along with versioning data and coupling their data with code. Also adding more collaborators to this workflow would be very time consuming...

ashleyalmost 10 years ago

>>I'm thinking now either an external hard drive, or on a VPS I can just tunnel into. Consider setting up an ElasticSearch cluster somewhere, like with AWS, which takes plugins for ElasticSearch. Once you've indexed your data with ES, then queries are pretty easy (JSON-based). This would also solve your other problem with data visualization. ElasticSearch has an analytics tool called Kibana. Pretty useful and doesn't require too much effort to set up or use. I'm using this setup for a sentiment analysis project myself.You didn't mention the libraries in your NLP pipeline (guessing NLTK bc of the Python?), but if you're doing LSM compatibility, I'm guessing you might be interested in clustering or topic-modelling algorithms and such...Mahout integrates easily with ElasticSearch.

评论 #9738041 未加载

pvaldesalmost 10 years ago

Don't know if is that you need or not, but common lisp has available the package 'cl-sentiment', specifically aimed to do sentiment analysis in text.<a href="https://github.com/RobBlackwell/cl-sentiment" rel="nofollow">https://github.com/RobBlackwell/cl-sentiment</a>Other packages that you could find useful are cl-mongo, for mongo no-sql databases, cl-mysql, postmodern (postgresql) and cl-postres (postgresql)And for Perl you have also Rate_sentiment<a href="http://search.cpan.org/~prath/WebService-GoogleHack-0.15/GoogleHack/Examples/Rate_Sentiment.pl" rel="nofollow">http://search.cpan.org/~prath/WebService-GoogleHack-0.15/Goo...</a>

skadamatalmost 10 years ago

The first immediate thing I would recommend is moving all of your files into AWS S3: <a href="http://aws.amazon.com/s3/" rel="nofollow">http://aws.amazon.com/s3/</a>Storage is super cheap, and you can get rid of the clutter on your laptop. I wouldn't recommend moving to a database yet, especially if you don't have any experience working with them before. S3 has great connector libraries and good integrations with things like Spark and Hadoop and other 'big data' analysis tools. I would start to go down that path and see which tools might be best for analyzing text files from S3!

评论 #9735961 未加载

CoreSetalmost 10 years ago

EDIT:I’m astounded by the number and quality of responses for appropriate tools. Thank you HN! To shed a little more light on the project:I’m compiling research for a sociology / gender studies project that takes a national snapshot of romantic posts from classifieds sites / sources across the country, and then uses that data to try and draw meaningful insights about how different races, classes, and genders of Americans engage romantically online.I’ve already run some basic text-processing algorithms (tf-idf, certain term frequency lists, etc) on smaller txt files that represent the content for a few major U.S metros and discovered some surprises that I think warrant a larger follow-up. So I have a few threads I want to investigate already, but I also don’t want to be blind to interesting comparisons that can be drawn between data sets, now that I have more information (that’s why I’m asking for a bit of a grab-bag of text-processing abilities).My problem is that the techniques from the first phase (analyzing a few metros) didn’t scale with the larger data set: The entire data set is only 2GB of text, but it started maxing my memory as I recopied the text files over and over again into different groupings. Starting with a datastore from the beginning would also have worked, but it just wasn’t necessary at the beginning of the project.My current setup: Python’s Beautiful Soup + CasperJS for scripting (which is done) Node, relying primarily on the excellent NLP package “natural,” for analysis Bash to tie things together My personal MBP as the environmentSO given the advice expressed in the thread (and despite my love of Shiny New Things), a combination of shell scripts and awk (a CL language specifically for structured text files!), which I had heard about before but thought was a networking tool, will probably work best, backed up by a 1TB or similar external drive, which I could use anyway (and would be more secure). I have the huge luxury of course that this is a one-time research-oriented project, and not something I need to worry about being performant, etc.I will of course look into a lot of the solutions provided here regardless, as something (especially along the visualizations angle) could prove more useful, and it’s all fascinating to me.Thanks again HN for all of your help.

评论 #9759231 未加载

bkinalmost 10 years ago

No answer to your question per se, but Software Engineering Radio has a nice episode about working with and extracting knowledge from larger bodies of text: <a href="http://www.se-radio.net/2014/11/episode-214-grant-ingersoll-on-his-book-taming-text/" rel="nofollow">http://www.se-radio.net/2014/11/episode-214-grant-ingersoll-...</a>

machinelearningalmost 10 years ago

We're building a database specifically to solve this problem. We're almost production ready and will be going OpenSource in a couple of months. Send us an email at textedb@gmail.com if you'd like to try it out. <a href="http://textedb.com/" rel="nofollow">http://textedb.com/</a>

评论 #9737258 未加载

quizoticalmost 10 years ago

There are commercial products that do a decent job of what you want. I have experience with Oracle Endeca, and have heard that Qlik is even better. Both have easy ways to load and visualize.There are research frameworks that are quite good. One is Factorie from U.Mass Amherst (factorie.cs.umass.edu) that supports latent dirichlet allocation and lots more. Stanford also has wonderful tools.Yes, you can dump text into SQL, and postgres has some text analytics. But my guess is that you'll soon want capabilities that RDBs don't have. Mongo has some support for text, but not nearly as much as postgres. I think both SQL and NoSQL are currently round pegs in square holes for deep text analytics. They're barely ok for search.

hudibrasalmost 10 years ago

Here's a good place to start: <a href="http://www.matthewjockers.net/text-analysis-with-r-for-students-of-literature/#comment-39991" rel="nofollow">http://www.matthewjockers.net/text-analysis-with-r-for-stude...</a>

chubotalmost 10 years ago

This sounds like an algorithmic issue. How many permutations are you generating? Are you sure you can scale it with different software tools or hardware, or is there an inherent exponential blowup?Are you familiar with big-O / computational complexity (I ask since you say your background is in social sciences.)A few GB's of input data is generally easy to work with on a single machine, using Python and bash. If you need big intermediate data, you can brute force it with distributed systems, hardware, C++, etc. but that can be time consuming, depending on the application.

jaz46almost 10 years ago

I'd have to know a little more about your setup to be sure, but Pachyderm (pachyderm.io) might be a viable option. Full disclosure, I'm one of the founders. The biggest advantage you'd get from our system is that you can continue using all of those python and bash scripts to analyze your data in a distributed fashion instead of having to learn/use SQL. If it looks like Pachyderm might be a good fit, feel free to email me joey@pachyderm.io

cafebeenalmost 10 years ago

Might be worth trying a visual analytics system like Overview:<a href="https://blog.overviewdocs.com/" rel="nofollow">https://blog.overviewdocs.com/</a>There's also a nice evaluation paper:<a href="http://www.cs.ubc.ca/labs/imager/tr/2014/Overview/overview.pdf" rel="nofollow">http://www.cs.ubc.ca/labs/imager/tr/2014/Overview/overview.p...</a>

Rainymoodalmost 10 years ago

Interesting. I recently wrote my thesis on Latent Dirichlet Allocation (LDA), it's worth checking it out. Without going into too much technical detail, LDA is a 'topic model'. Given a large set of documents (a corpus), it estimates the 'topics' of the corpus and gives a breakdown of each document, in terms of how much it contains of topic 1, topic 2, etc.

SQL2219almost 10 years ago

SQL Server has a feature called file tables. Basically it's the ability to dump a bunch of files into a folder, you can then query the contents using semantic search.<a href="https://msdn.microsoft.com/en-us/library/ff929144%28v=sql.110%29.aspx" rel="nofollow">https://msdn.microsoft.com/en-us/library/ff929144%28v=sql.11...</a>

kaa2102almost 10 years ago

Looking at some of the comments it would be helpful if you took a step back and clearly defined for the data, data markers and relationship hypothesis you are looking for.I've done some text file parsing and analysis with just C++ and Excel. You could possibly simplify the analytical process by clearly defining what you need from the text file.

tedchsalmost 10 years ago

Have you considered Google Bigquery? It's a managed data warehouse with a SQL-like query language. Easy to load in your data, run queries, then drop the database when you're done with it.

effnorwoodalmost 10 years ago

<a href="https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html" rel="nofollow">https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html</a>

wigsgiwalmost 10 years ago

You might be interested in <a href="http://entopix.com" rel="nofollow">http://entopix.com</a>, could be ideal.

timminsalmost 10 years ago

I used a Win app called TextPipe. It has a large feature set to wrangle text but the visualization aspect may need another tool.

halaylialmost 10 years ago

it's not clear what your objective is. I can have 1kb text file and end up with a 1TB file after "analyzing" if I don't have a goal in mind.

gt565kalmost 10 years ago

apache solr or elasticsearch

kitwalker12almost 10 years ago

a java application with SQL adapters and Apache Tika might work

bra-ketalmost 10 years ago

Apache Spark

评论 #9737919 未加载

nodivbyzeroalmost 10 years ago

grep, sed

codeonfirealmost 10 years ago

If you want high performance and simple why not use flat files, bash, grep (maybe parallel), cut, awk, wc, uniq, etc. You can get very far with these and if you have a fast local disk you cat get huge read rates. A few GB can be scanned in a matter of seconds. Awk can be used to write your queries. I don't understand what you are trying to do, but if it can be done with a SQL database and doesn't involve lots of joins then it can be done with a delimited text file. If you don't have a lot of disk space you can also work with gzipped files, zcat, zgrep, etc. I would not even consider distributed solutions or nosql until I had at least 100GB of data (more like 1TB of data). I would not consider any sort of SQL database unless I had a lot of complex joins.

评论 #9737001 未加载

developer1almost 10 years ago

Does the NSA allow its employees to look for help from the general public like this? Seems odd for such a secretive organization to post publicly asking for help on how to parse our conversations.

评论 #9737013 未加载