TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

How to solve most NLP problems: a step-by-step guide

461 pointsby e_ameisenover 7 years ago

13 comments

minimaxirover 7 years ago
Word2Vec and bag-of-words&#x2F;tf-idf are somewhat obsolete in 2018 for modeling. For classification tasks, fasttext (<a href="https:&#x2F;&#x2F;github.com&#x2F;facebookresearch&#x2F;fastText" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;facebookresearch&#x2F;fastText</a>) performs better and faster.<p>Fasttext is also available in the popular NLP Python library gensim, with a good demo notebook: <a href="https:&#x2F;&#x2F;radimrehurek.com&#x2F;gensim&#x2F;models&#x2F;fasttext.html" rel="nofollow">https:&#x2F;&#x2F;radimrehurek.com&#x2F;gensim&#x2F;models&#x2F;fasttext.html</a><p>And of course, if you have a GPU, recurrent neural networks (or other deep learning architectures) are the endgame for the remaining 10% of problems (a good example is SpaCy&#x27;s DL implementation: <a href="https:&#x2F;&#x2F;spacy.io&#x2F;" rel="nofollow">https:&#x2F;&#x2F;spacy.io&#x2F;</a>). Or use those libraries to incorporate fasttext for text encoding, which has worked well in my use cases.
评论 #16225004 未加载
评论 #16224935 未加载
评论 #16225477 未加载
评论 #16224955 未加载
评论 #16224692 未加载
评论 #16224717 未加载
评论 #16224727 未加载
评论 #16229051 未加载
评论 #16225724 未加载
评论 #16229478 未加载
评论 #16230636 未加载
评论 #16230370 未加载
评论 #16224934 未加载
odonnellryanover 7 years ago
I am not sure how many people have an issue with this, but it seems to me that computer science, just over the relatively short time I&#x27;ve been paying attention, is becoming more-and-more abstract in a lot of ways.<p>You can code something incredibly complex that works great without understanding any of the math underneath. Understanding the math arguably makes you a better engineer overall, but isn&#x27;t required to solve many of these problems.<p>I think it&#x27;s pretty cool, but I&#x27;m sure a lot of people have a big issue with the &quot;just TRUST the library!&quot; approach.
评论 #16224964 未加载
评论 #16224650 未加载
评论 #16225036 未加载
评论 #16224841 未加载
评论 #16224588 未加载
评论 #16225978 未加载
评论 #16224979 未加载
评论 #16224578 未加载
paulsutterover 7 years ago
NLP is one of the most challenging areas of research, and nothing in this article will help solve even 0.009% of those challenges<p>Example of the wisdom herein:<p>&gt; Remove words that are not relevant, such as “@” twitter mentions or urls
评论 #16224851 未加载
评论 #16225038 未加载
评论 #16224824 未加载
评论 #16224978 未加载
评论 #16224689 未加载
评论 #16224616 未加载
评论 #16225487 未加载
评论 #16225394 未加载
Rickasaurusover 7 years ago
Bag of words is the death of comprehensible NLP
评论 #16224700 未加载
polm23over 7 years ago
The thing that jumped out at me in this article was the use of Lime to explain models - I hadn&#x27;t heard of it before.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;marcotcr&#x2F;lime" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;marcotcr&#x2F;lime</a><p>For NLP tasks, it looks like what it does is selectively delete words from the input and check the classifier output. This way it determines which words have the biggest effect on the output without needing to know anything about how your model works.
paultopiaover 7 years ago
I think this might be the first blog post I&#x27;ve read to actually explain how to use word vectors as features---good for the author!
mikevmover 7 years ago
A question to the NLP experts out there -- is it possible to automatically detect various pre-defined attributes about a person by automatically analyzing relevant texts? For example, finding out whether a person is anti-capitalist by scanning his blog posts related to economics. I&#x27;m not even sure how to approach such a problem.
评论 #16225396 未加载
评论 #16225098 未加载
master_yoda_1over 7 years ago
The title os the blog is to ambitious.
CGamesPlayover 7 years ago
This was an interesting read, but when I read sentences like &quot;The words it picked up look much more relevant!&quot;, I&#x27;m reminded of the XKCD explanation of machine learning: <a href="https:&#x2F;&#x2F;xkcd.com&#x2F;1838&#x2F;" rel="nofollow">https:&#x2F;&#x2F;xkcd.com&#x2F;1838&#x2F;</a>
hinkleyover 7 years ago
I know natural language processing predates Neuro-linguistic programming, but I still can’t see ‘NLP’ without the little hairs on the back of my neck standing up.
fnlover 7 years ago
Bad title (this is all about text classification&#x2F;mining, not NLP), but a very nice introduction at that. Maybe a tad optimistic - I&#x27;d never even consider applying a classifier with 80% accuracy to the Twitter firehose (unless extremely noisy performance were a non-issue - but it never is ... :-)).
code4teeover 7 years ago
Good intro, but the approaches used here are quite basic and outdated for 2018. Not sure this solves 90% of NLP problems.
评论 #16224757 未加载
评论 #16225925 未加载
phijFTWover 7 years ago
I think<p><pre><code> def sanitize_characters(raw, clean): for line in input_file: out = line output_file.write(line) sanitize_characters(input_file, output_file) </code></pre> should be<p><pre><code> def sanitize_characters(raw, clean): for line in raw: out = line clean.write(line) sanitize_characters(input_file, output_file) </code></pre> in your notebook: <a href="https:&#x2F;&#x2F;github.com&#x2F;hundredblocks&#x2F;concrete_NLP_tutorial&#x2F;blob&#x2F;master&#x2F;NLP_notebook.ipynb" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;hundredblocks&#x2F;concrete_NLP_tutorial&#x2F;blob&#x2F;...</a><p>Or am I mistaken?