TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Naive Bayes and Text Classification I – Introduction and Theory

115 pointsby rasbtover 10 years ago

7 comments

syllogismover 10 years ago
&gt; Naive Bayes classifiers, a family of classifiers that are based on the popular Bayes’ probability theorem, are known for creating simple yet well performing models, especially in the fields of document classification and disease prediction.<p>But this is simply not true! They _don&#x27;t_ perform well. There&#x27;s really no reason to teach people Naive Bayes any more, except as a footnote when explaining log-linear&#x2F;MaxEnt models.<p>MaxEnt is not so complicated, and it makes Naive Bayes fully obsolete. And if MaxEnt is in some way too complicated&#x2F;expensive, Averaged Perceptron is generally much better than NB, can be implemented in 50 lines of Python, and has far fewer hyper-parameters.<p>A common way for machine learning courses to suck is to teach students about a bunch of crap, obsolete algorithms they should never use, simply for historical reasons --- they used to be in the course, so they stay in the course.
评论 #8411973 未加载
评论 #8411746 未加载
评论 #8411839 未加载
rasbtover 10 years ago
I posted this article some while ago, and I thought it may not be a bad idea to archive it on arXiv. What do you think? The (Latex) PDF version would look like this: <a href="http://sebastianraschka.com/PDFs/articles/naive_bayes_1.pdf" rel="nofollow">http:&#x2F;&#x2F;sebastianraschka.com&#x2F;PDFs&#x2F;articles&#x2F;naive_bayes_1.pdf</a> I just signed up on arXiv and see that the catch is that I&#x27;d need 2 recommendations for the categories Computer Science -&gt; Learning or Statistics -&gt; Machine Learning. Would anyone here who would want to give me a recommendation? I would really appreciate it!
dj-wonkover 10 years ago
Re: &quot;Empirical studies showed that the mutli-variate Bernoulli model is inferior to the multinomial Bayes model, and the latter has been shown to reduce the error by 27% on average [13].&quot;<p>I remain skeptical of statements that claim that one kind of model is categorically inferior. Also, I have a hard time believing this is a fair summary of the literature. First, one study is cited, but the sentence implies many.<p>Second, the citation in the original post is not a fair summary of the cited research. The abstract of &quot;A Comparison of Event Models for Naive Bayes Text Classification&quot; says: &quot;This paper aims to clarify the confusion by describing the differences and details of these two models, and by empirically comparing their classification performance on five text corpora. We find that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes—providing on average a 27% reduction in error over the multi-variate Bernoulli model at any vocabulary size.&quot;
评论 #8413589 未加载
MojoJoloover 10 years ago
I actually really like doing text classification using Naive Bayes. I&#x27;m still new to it and still learning a lot. But one thing I&#x27;m having a hard time answering is explaining Naive Bayes classification in simple terms.<p>If that was asked to you, how can you explain Naive Bayes classification in simple terms?
评论 #8411823 未加载
meaty_sausagesover 10 years ago
Lexicon-based approaches are better than all these approaches. The difference between MaxEnt and NB is similar (the same?) as a binomial&#x2F;multinomial regression. They have all of the drawbacks and only a few of the benefits. A decent word list would be about as accurate as a regression (70% or so). Account for syntax and it will get you higher.
dj-wonkover 10 years ago
There are typos:<p>2.6.3. The correct term is &quot;Additive Smoothing&quot; <a href="https://en.wikipedia.org/wiki/Additive_smoothing" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Additive_smoothing</a> not &quot;smoothening&quot;.<p>3.3. Letters transposed. The author meant &quot;Multi-variate&quot;.
评论 #8412827 未加载
en4bzover 10 years ago
I&#x27;m currently working on a school project to classify abstracts from papers in to 4 categories. Currently Bernoulli Naive Bayes scores 85% accuracy. For comparison kNN scores about the same.
评论 #8413565 未加载