> Essentially, when dealing with natural languages hacking a solution is the suggested way of doing things, since nobody can figure out how to do it properly.<p>That's really the TL;DR I also got from the computational linguistic courses I attended.<p>There's probably the Pareto principle at works. Having no solution is worse than having an 80% solution that works well enough when the 100% solution is much harder to achieve (and some of the problems not even humans would be able to solve properly).
Ha, there's a whole section on clones of the summarizer from Classifier4J.<p>I wrote that in 2003 (I think?) based on @pg's "A plan for spam" essay, and then "invented" the summarization approach (I'm sure others had done similar, but I thought it up myself anyway).<p>Turns out it was rather well tuned. The 2003 implementation, presumably downloaded from sourceforge(!) still wins comparisons on datasets which didn't even exist when I wrote it[1].<p>I much prefer the Python implementation though[2], which I hadn't seen before.<p>Also, Textacy on top of Spacy is awesome for any kind of text work.<p>[1] <a href="https://dl.acm.org/citation.cfm?id=2797081" rel="nofollow">https://dl.acm.org/citation.cfm?id=2797081</a><p>[2] <a href="https://github.com/thavelick/summarize/blob/master/summarize.py" rel="nofollow">https://github.com/thavelick/summarize/blob/master/summarize...</a>
There are a few applications missing:<p>- Answering a question by returning a search result from a large body of texts. E.g. "How do I change the background color of a page in Javascript?"<p>- Improving the readability of a text. The article only mentions "understanding how difficult to read is a text".<p>- Establishing relationships between entities in a body of text. E.g. we could build a fact-graph from sentences like "Burning coal increases CO2", and "CO2 increase induces global warming". Useful also in medical literature where there are millions of pathways.<p>- Answering a question, using a large body of facts. Like search, but now it gives a precise answer.<p>- Finding and correcting spelling/grammatical errors.
A lot to review, read, learn. Thanks a lot for sharing this. Any plans to extend it or have another one including even more, like Natural Language Generation (not limited to bots, we are using it in weather forecast), and co-reference?
I'm always astonished how little mention gensim gets, considering that it can basically be used for all the listed tasks, including parsing, if you combine it with your favorite deep learning library (DyNet, anyone?).
Was hoping for some discussion about word vectors like word2vec. I keep reading about them, but don't really understand what they're useful for.
My experience with your site on mobile: <a href="https://m.imgur.com/5vLrEJH" rel="nofollow">https://m.imgur.com/5vLrEJH</a><p>Can't get it to go away, can't read the article.
Is there an equivalent to MNIST for NLP? I've always wanted to play around in this space but I don't know a good, and simple, database to start with.
Your 'send me a PDF' popup has the background fade div above the form so it's impossible to fill in the form (without opening dev tools).
Quite an obnoxious website on my phone. Anyway I came here to point to GATE as a mature FLOSS option: <a href="https://gate.ac.uk/" rel="nofollow">https://gate.ac.uk/</a>
Recommend Dan Jurafsky and Chris Manning @ Stanford online course:<p><a href="https://www.youtube.com/watch?v=nfoudtpBV68" rel="nofollow">https://www.youtube.com/watch?v=nfoudtpBV68</a>