Oh my yes! Machine learning one of the hardest programming there is. You only ever get an indirect measure if it is working correctly or not. Its hard to debug. My algorithms gets it right 80%, have I made an implementation mistake? Who knows?<p>My general strategy is to invest into training set curation and evaluation. I also use quick scatter scatter plots to check <i>I</i> can seperate the training sets into classes easy. If its not easy to do by eye then the machine is not magic and probably can't either. If I can't, then its time to rethink the representation.<p>The author correctly underlines the importance of training set, but also equally critical to have the right representation (the features). If you project your data into the right space then pretty much any ML algorithm will be able to learn on it. i.e. its more about what data you put in, rather than the processor. k-means and decision trees FTW<p>EDIT:
Oh and maybe very relevant is the importance of data normalization. Naive Bayes classifiers require features to be conditionally independent. So you have to PCA or ICA your data first (or both), otherwise features get "counted" twice. E.g. every wedding related word counting equally toward spam catagorization. PCA realizes which variables are highly correlated and projects them into a common measure of "weddingness". Very easy with skilean its preprocessing.Scaler() and turn whitening on.
I've wanted to build this for a while, I think an SVM-based spam solution could be amazing. Obviously, like the article mentions, when trying to categorize spam, a purely Bayesian approach is not great -- and neither is an ANN (although, with a large enough pool of hidden layers, it can get pretty decent). I think that the issue lies in the problem set. Spam cannot be treated like a linearly-separable model.<p>There are papers[1][2] that outline possible benefits of SVM-based spam filtering. Unfortunately, SVMs are still in their infancy and not many people know how to implement and use them. I do think the are the future, however.<p>[1] <a href="http://trec.nist.gov/pubs/trec15/papers/hit.spam.final.final.pdf" rel="nofollow">http://trec.nist.gov/pubs/trec15/papers/hit.spam.final.final...</a><p>[2] <a href="http://classes.soe.ucsc.edu/cmps290c/Spring12/lect/14/00788645-SVMspam.pdf" rel="nofollow">http://classes.soe.ucsc.edu/cmps290c/Spring12/lect/14/007886...</a>
The problem now is that (imho) 99% of the links posted on the internet are spam.<p>Unless you have a baseline of "what was here first" and "exactly when every website went live with what links" like Google does (because they have been indexing websites since the dawn of time as far as the internet and linking is concerned. Heck, there wasn't even backlink spamming prior to Google because Google was the first search engine to rank by number of backlinks!)... you're going to have a really tough time determining what spam is and what it isn't.<p>Which 1% do you decide to focus in on?
I've built an anti-spam system for Delicious.com using Naive Bayes classifier with a really huge feature database, think tens of millions, mostly tokens in different parts of the page, those features are given different weights which contribute to the final probability aggregation. The result was similar to what the OP achieved - around 80% accuracy. The piece of work was really interesting and satisfying.
I'm surprised this post doesn't mention Markov chains. The author seems to think that finding and implementing a grammar quality checker will help stop spam. Aside from provides endless hours of entertainment viz. DissociatedPress, Markov chains are abused by spammers to generate grammatically correct nonsense. You can easily add meaning to the "nonsense" by adding formatting to certain words to add a secondary message. Does anyone know of a way to stop this?
This is the really hard way.<p>And it is going to fail A LOT.<p>Do this instead:<p>1. Contact a company that has a searchengine and therefore access to all your links. ( <a href="http://samuru.com" rel="nofollow">http://samuru.com</a> ) springs to mind.<p>2. Do keyword extraction of those pages. Assume that anything that doesn't have any of the keywords of the page that is being linked to is a Bad link.<p>3. The ones that remain Google the keywords you extracted. (like 10 of the words) if the linking page doesn't appear in the top 50 results it is probably a Bad neighbor according to Google.<p>This method doesn't require NTLK, or Grammar checking. You can do it your self, and you are using Google to tell you if the site is on the Bad Neighbor list so you don't have to guess.
I've tried doing something similar with AI the other day. My approach was looking at money flow instead, as in theory, spammers only spam to make money. I basically downloaded an ad-blocker list and ran it against a pages source. That along with a couple of other factors were fed into many attempts of machine learning fun. In the end, it all failed. I learned that it's just impossible without a data-set like google's, so I went and build them into the process, and voila, it worked.
I have used maximum entropy classification for a quite similar task. It achieves better performances than Naive Bayes classifiers. But as the author remarked, the quality of the training set and the selection of features are very important aspects too.
The Bayesian filter might have worked. Theres no reason to use only content as a feature - you can use all the features you want regardless of which ML technique you apply. Bayesian poisoning is a real concern though.