TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Naive Bayes and Text Classification I – Introduction and Theory

115 点作者 rasbt超过 10 年前

7 条评论

syllogism超过 10 年前
&gt; Naive Bayes classifiers, a family of classifiers that are based on the popular Bayes’ probability theorem, are known for creating simple yet well performing models, especially in the fields of document classification and disease prediction.<p>But this is simply not true! They _don&#x27;t_ perform well. There&#x27;s really no reason to teach people Naive Bayes any more, except as a footnote when explaining log-linear&#x2F;MaxEnt models.<p>MaxEnt is not so complicated, and it makes Naive Bayes fully obsolete. And if MaxEnt is in some way too complicated&#x2F;expensive, Averaged Perceptron is generally much better than NB, can be implemented in 50 lines of Python, and has far fewer hyper-parameters.<p>A common way for machine learning courses to suck is to teach students about a bunch of crap, obsolete algorithms they should never use, simply for historical reasons --- they used to be in the course, so they stay in the course.
评论 #8411973 未加载
评论 #8411746 未加载
评论 #8411839 未加载
rasbt超过 10 年前
I posted this article some while ago, and I thought it may not be a bad idea to archive it on arXiv. What do you think? The (Latex) PDF version would look like this: <a href="http://sebastianraschka.com/PDFs/articles/naive_bayes_1.pdf" rel="nofollow">http:&#x2F;&#x2F;sebastianraschka.com&#x2F;PDFs&#x2F;articles&#x2F;naive_bayes_1.pdf</a> I just signed up on arXiv and see that the catch is that I&#x27;d need 2 recommendations for the categories Computer Science -&gt; Learning or Statistics -&gt; Machine Learning. Would anyone here who would want to give me a recommendation? I would really appreciate it!
dj-wonk超过 10 年前
Re: &quot;Empirical studies showed that the mutli-variate Bernoulli model is inferior to the multinomial Bayes model, and the latter has been shown to reduce the error by 27% on average [13].&quot;<p>I remain skeptical of statements that claim that one kind of model is categorically inferior. Also, I have a hard time believing this is a fair summary of the literature. First, one study is cited, but the sentence implies many.<p>Second, the citation in the original post is not a fair summary of the cited research. The abstract of &quot;A Comparison of Event Models for Naive Bayes Text Classification&quot; says: &quot;This paper aims to clarify the confusion by describing the differences and details of these two models, and by empirically comparing their classification performance on five text corpora. We find that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes—providing on average a 27% reduction in error over the multi-variate Bernoulli model at any vocabulary size.&quot;
评论 #8413589 未加载
MojoJolo超过 10 年前
I actually really like doing text classification using Naive Bayes. I&#x27;m still new to it and still learning a lot. But one thing I&#x27;m having a hard time answering is explaining Naive Bayes classification in simple terms.<p>If that was asked to you, how can you explain Naive Bayes classification in simple terms?
评论 #8411823 未加载
meaty_sausages超过 10 年前
Lexicon-based approaches are better than all these approaches. The difference between MaxEnt and NB is similar (the same?) as a binomial&#x2F;multinomial regression. They have all of the drawbacks and only a few of the benefits. A decent word list would be about as accurate as a regression (70% or so). Account for syntax and it will get you higher.
dj-wonk超过 10 年前
There are typos:<p>2.6.3. The correct term is &quot;Additive Smoothing&quot; <a href="https://en.wikipedia.org/wiki/Additive_smoothing" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Additive_smoothing</a> not &quot;smoothening&quot;.<p>3.3. Letters transposed. The author meant &quot;Multi-variate&quot;.
评论 #8412827 未加载
en4bz超过 10 年前
I&#x27;m currently working on a school project to classify abstracts from papers in to 4 categories. Currently Bernoulli Naive Bayes scores 85% accuracy. For comparison kNN scores about the same.
评论 #8413565 未加载