TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Machine Learning and Link Spam: My Brush With Insanity

116 pointsby a5seoabout 12 years ago

9 comments

tlarkworthyabout 12 years ago
Oh my yes! Machine learning one of the hardest programming there is. You only ever get an indirect measure if it is working correctly or not. Its hard to debug. My algorithms gets it right 80%, have I made an implementation mistake? Who knows?<p>My general strategy is to invest into training set curation and evaluation. I also use quick scatter scatter plots to check <i>I</i> can seperate the training sets into classes easy. If its not easy to do by eye then the machine is not magic and probably can't either. If I can't, then its time to rethink the representation.<p>The author correctly underlines the importance of training set, but also equally critical to have the right representation (the features). If you project your data into the right space then pretty much any ML algorithm will be able to learn on it. i.e. its more about what data you put in, rather than the processor. k-means and decision trees FTW<p>EDIT: Oh and maybe very relevant is the importance of data normalization. Naive Bayes classifiers require features to be conditionally independent. So you have to PCA or ICA your data first (or both), otherwise features get "counted" twice. E.g. every wedding related word counting equally toward spam catagorization. PCA realizes which variables are highly correlated and projects them into a common measure of "weddingness". Very easy with skilean its preprocessing.Scaler() and turn whitening on.
评论 #5607495 未加载
dvtabout 12 years ago
I've wanted to build this for a while, I think an SVM-based spam solution could be amazing. Obviously, like the article mentions, when trying to categorize spam, a purely Bayesian approach is not great -- and neither is an ANN (although, with a large enough pool of hidden layers, it can get pretty decent). I think that the issue lies in the problem set. Spam cannot be treated like a linearly-separable model.<p>There are papers[1][2] that outline possible benefits of SVM-based spam filtering. Unfortunately, SVMs are still in their infancy and not many people know how to implement and use them. I do think the are the future, however.<p>[1] <a href="http://trec.nist.gov/pubs/trec15/papers/hit.spam.final.final.pdf" rel="nofollow">http://trec.nist.gov/pubs/trec15/papers/hit.spam.final.final...</a><p>[2] <a href="http://classes.soe.ucsc.edu/cmps290c/Spring12/lect/14/00788645-SVMspam.pdf" rel="nofollow">http://classes.soe.ucsc.edu/cmps290c/Spring12/lect/14/007886...</a>
评论 #5606001 未加载
评论 #5606430 未加载
评论 #5606240 未加载
评论 #5606146 未加载
ZeroCoinabout 12 years ago
The problem now is that (imho) 99% of the links posted on the internet are spam.<p>Unless you have a baseline of "what was here first" and "exactly when every website went live with what links" like Google does (because they have been indexing websites since the dawn of time as far as the internet and linking is concerned. Heck, there wasn't even backlink spamming prior to Google because Google was the first search engine to rank by number of backlinks!)... you're going to have a really tough time determining what spam is and what it isn't.<p>Which 1% do you decide to focus in on?
评论 #5605715 未加载
评论 #5605748 未加载
btw0about 12 years ago
I've built an anti-spam system for Delicious.com using Naive Bayes classifier with a really huge feature database, think tens of millions, mostly tokens in different parts of the page, those features are given different weights which contribute to the final probability aggregation. The result was similar to what the OP achieved - around 80% accuracy. The piece of work was really interesting and satisfying.
评论 #5608238 未加载
a_pabout 12 years ago
I'm surprised this post doesn't mention Markov chains. The author seems to think that finding and implementing a grammar quality checker will help stop spam. Aside from provides endless hours of entertainment viz. DissociatedPress, Markov chains are abused by spammers to generate grammatically correct nonsense. You can easily add meaning to the "nonsense" by adding formatting to certain words to add a secondary message. Does anyone know of a way to stop this?
评论 #5605858 未加载
评论 #5605724 未加载
评论 #5605845 未加载
drakaalabout 12 years ago
This is the really hard way.<p>And it is going to fail A LOT.<p>Do this instead:<p>1. Contact a company that has a searchengine and therefore access to all your links. ( <a href="http://samuru.com" rel="nofollow">http://samuru.com</a> ) springs to mind.<p>2. Do keyword extraction of those pages. Assume that anything that doesn't have any of the keywords of the page that is being linked to is a Bad link.<p>3. The ones that remain Google the keywords you extracted. (like 10 of the words) if the linking page doesn't appear in the top 50 results it is probably a Bad neighbor according to Google.<p>This method doesn't require NTLK, or Grammar checking. You can do it your self, and you are using Google to tell you if the site is on the Bad Neighbor list so you don't have to guess.
评论 #5606202 未加载
ZirconCodeabout 12 years ago
I've tried doing something similar with AI the other day. My approach was looking at money flow instead, as in theory, spammers only spam to make money. I basically downloaded an ad-blocker list and ran it against a pages source. That along with a couple of other factors were fed into many attempts of machine learning fun. In the end, it all failed. I learned that it's just impossible without a data-set like google's, so I went and build them into the process, and voila, it worked.
评论 #5606115 未加载
JacobiXabout 12 years ago
I have used maximum entropy classification for a quite similar task. It achieves better performances than Naive Bayes classifiers. But as the author remarked, the quality of the training set and the selection of features are very important aspects too.
zaptheimpalerabout 12 years ago
The Bayesian filter might have worked. Theres no reason to use only content as a feature - you can use all the features you want regardless of which ML technique you apply. Bayesian poisoning is a real concern though.