Word2Vec and bag-of-words/tf-idf are somewhat obsolete in 2018 for modeling. For classification tasks, fasttext (<a href="https://github.com/facebookresearch/fastText" rel="nofollow">https://github.com/facebookresearch/fastText</a>) performs better and faster.<p>Fasttext is also available in the popular NLP Python library gensim, with a good demo notebook: <a href="https://radimrehurek.com/gensim/models/fasttext.html" rel="nofollow">https://radimrehurek.com/gensim/models/fasttext.html</a><p>And of course, if you have a GPU, recurrent neural networks (or other deep learning architectures) are the endgame for the remaining 10% of problems (a good example is SpaCy's DL implementation: <a href="https://spacy.io/" rel="nofollow">https://spacy.io/</a>). Or use those libraries to incorporate fasttext for text encoding, which has worked well in my use cases.
I am not sure how many people have an issue with this, but it seems to me that computer science, just over the relatively short time I've been paying attention, is becoming more-and-more abstract in a lot of ways.<p>You can code something incredibly complex that works great without understanding any of the math underneath. Understanding the math arguably makes you a better engineer overall, but isn't required to solve many of these problems.<p>I think it's pretty cool, but I'm sure a lot of people have a big issue with the "just TRUST the library!" approach.
NLP is one of the most challenging areas of research, and nothing in this article will help solve even 0.009% of those challenges<p>Example of the wisdom herein:<p>> Remove words that are not relevant, such as “@” twitter mentions or urls
The thing that jumped out at me in this article was the use of Lime to explain models - I hadn't heard of it before.<p><a href="https://github.com/marcotcr/lime" rel="nofollow">https://github.com/marcotcr/lime</a><p>For NLP tasks, it looks like what it does is selectively delete words from the input and check the classifier output. This way it determines which words have the biggest effect on the output without needing to know anything about how your model works.
A question to the NLP experts out there -- is it possible to automatically detect various pre-defined attributes about a person by automatically analyzing relevant texts? For example, finding out whether a person is anti-capitalist by scanning his blog posts related to economics. I'm not even sure how to approach such a problem.
This was an interesting read, but when I read sentences like "The words it picked up look much more relevant!", I'm reminded of the XKCD explanation of machine learning: <a href="https://xkcd.com/1838/" rel="nofollow">https://xkcd.com/1838/</a>
I know natural language processing predates Neuro-linguistic programming, but I still can’t see ‘NLP’ without the little hairs on the back of my neck standing up.
Bad title (this is all about text classification/mining, not NLP), but a very nice introduction at that. Maybe a tad optimistic - I'd never even consider applying a classifier with 80% accuracy to the Twitter firehose (unless extremely noisy performance were a non-issue - but it never is ... :-)).
I think<p><pre><code> def sanitize_characters(raw, clean):
for line in input_file:
out = line
output_file.write(line)
sanitize_characters(input_file, output_file)
</code></pre>
should be<p><pre><code> def sanitize_characters(raw, clean):
for line in raw:
out = line
clean.write(line)
sanitize_characters(input_file, output_file)
</code></pre>
in your notebook: <a href="https://github.com/hundredblocks/concrete_NLP_tutorial/blob/master/NLP_notebook.ipynb" rel="nofollow">https://github.com/hundredblocks/concrete_NLP_tutorial/blob/...</a><p>Or am I mistaken?