Help Reddit build a recommender

153 pointsby ahalanabout 13 years ago

12 comments

zachabout 13 years ago

Some may remember that Reddit used to have an item recommender a long time ago, back in its first year or so. It was a Bayesian classifier that, since it needed a bunch of input, only worked for the most hardcore members — who had already seen almost all of the recommendations!This was originally the "hard problem" at the center of Reddit.Let me explain what I mean by that. There used to be a quaint notion that to be a respectable tech startup, you had to have a "hard problem" (technologically speaking) at your core, which you had an innovative "secret sauce" solution for, preferably one you were patenting. After all, if not, then someone can just copy you and squash you like a bug, right?Since then, YC's insistent focus on making something people want, Eric Ries' lean startup gospel and many entrepreneurs' own experiences have thankfully gone a long way to convince people (most importantly SV investors) that focusing on a "hard problem" is not only unnecessary, but may end up being a fatal distraction.This is a pretty good example of how the "hard problem" can turn out to be completely irrelevant. Once it was clear that the recommendation engine wasn't a growth vector, the Reddit team seemed to drop it out of sheer pragmatism. They just needed to keep the site running.I can't recall many who cared or even noticed that the "recommended" tab was gone. But from that point on, Reddit was more free to become not just a quirky "personalized news" startup, but what it has aspired to since: the front page of the internet. And only now, just now, do a good chunk of the millions of users think a recommender might be nice.It's the startup version of "you aren't gonna need it" — if it doesn't drive growth, push it aside.

评论 #3719304 未加载

评论 #3719253 未加载

评论 #3719172 未加载

评论 #3719531 未加载

jstepienabout 13 years ago

During the previous semester I spent some time building a recommender using this data as a project for a data mining class. It turned out to be far more challenging than I had initially anticipated.I've used methods known as collaborative filtration, whose goal was to estimate how a given user would rate a given item basing on knowledge of preferences of other users of similar interests. The initial scope included a naïve Bayesian classifier and a technique called Slope One [1]. The latter one is particularly interesting as according to claims of its authors allows to make a very good estimation in a very short time using solely a very simple linear model. The preprocessing is both time- and space-wise expensive though as it requires you to build a matrix of deviations between rated items.After reducing the data set to a single subreddit and filtering it from users who weren't avid voters I ran the algorithms and after some tuning I was very content to see promising ROC curves and decent AUC values. Models built around NBC and S1 achieved comparable results when it came to such metrics as precision, recall and F-measure.When I went to discuss the results with the professor teaching the class I've heard "That's indeed promising, but how about comparing those results with a really naïve model which would just take an average of existing votes by a given user?". Guess what: the model built solely using a single call to the avg function was nearly as good as the NBC and S1 models.Now I understand why the guys from Reddit are looking for external help with the recommender. It's a way less obvious task than it might seem to be.[1] <a href="http://lemire.me/fr/documents/publications/lemiremaclachlan_sdm05.pdf" rel="nofollow">http://lemire.me/fr/documents/publications/lemiremaclachlan_...</a>Edit: s/machine learning/data mining/

评论 #3719435 未加载

评论 #3719504 未加载

espeedabout 13 years ago

The Neo4j User Group wants to help with this (<a href="https://groups.google.com/d/topic/neo4j/rkhjlQx-bfo/discussion" rel="nofollow">https://groups.google.com/d/topic/neo4j/rkhjlQx-bfo/discussi...</a>).Gremlin (<a href="https://github.com/tinkerpop/gremlin/wiki" rel="nofollow">https://github.com/tinkerpop/gremlin/wiki</a>) works great for real-time recommendations.See "A Graph-Based Movie Recommender Engine" by Gremlin's creator, Marko Rodriguez (<a href="http://markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-engine/" rel="nofollow">http://markorodriguez.com/2011/09/22/a-graph-based-movie-rec...</a>)

mikeklaasabout 13 years ago

For anyone who's trying this, I recommend basing your effort on factor models (i.e., the thing that won the netflix prize). It works very well for us at Zite.(Content models are the other, probably less interesting, 50% of the solution.)

krelianabout 13 years ago

This is a bit old, no? Anyway, I don't need to a recommender, I need a better way to let me affect the weight different subs have on the homepage. I need a way to group different low traffic subreddits together so that I won't miss their content among the high traffic ones.Reddit's old interface doesn't work anymore now that there are so many subs. The fact that there has been so little interface improvements in the last couple of years is pretty sad. I can't imagine browsing the site without RES.The way things work now only helps to magnify the lower quality trend because the homepage gives undue weight to content from popular subs.

评论 #3719190 未加载

wrathabout 13 years ago

This is a pretty open ended problem!How to you measure success? After I create my algorithm how do I know that I'm close to what reddit wants? Without answers to these questions, IMO, this is an exercise in futility. I'm not close enough to the project but written my fair share of classifiers and clustering engines any machine learning problem there needs to be a way to measure success. My point of view on a great result is different from reddits for sure.

thedarkabout 13 years ago

This is exactly the sort of thing a properly implemented tagging system would have solved. Along with their notorious search problems. Along with the difficulty in finding subreddits. Along with discovering old content. 6 years later I maintain this as a mistake.

评论 #3719125 未加载

评论 #3719108 未加载

评论 #3719110 未加载

评论 #3719109 未加载

评论 #3719403 未加载

PaulHouleabout 13 years ago

Collaborative filtering is a boring problem and doesn't get to the heart of what's wrong with Reddit, Hacker News, and such.For one thing, many good stories languish on the "new" page and never get enough votes to get a fair shake. Collaborative filtering doesn't help with this, if anything it makes it worse.Last night I made a crude boomerang by glueing two rulers together, this morning it had set and my son pressured me to try throwing it before I'd even finished my breakfast. Right when it started to curve, it hit a telephone pole and broke at the glue joint.When I see many of the things people want to do on reddit, my first impression is it will wind up like that. For instance, LSI is one of those things that does not work so well in real life... They still seem to be teaching kids about it, but not that you get results almost good doing dimensional reduction with a random basis set.If you've got some semantic analysis and predictive models, you can make an automated system that picks quality relevant content out of the "new" queue and because you can use smart feature selection you don't need to wrangle as much data -- training is orders of magnitude faster and you don't need to futz around with hadoop.

larsabout 13 years ago

I don't think you need voting data. Rather, answer the question: "what subreddit is similar to this particular subreddit". Then you use the degree of overlap in subscribers as a distance measure between subreddits. Use a tf*idf like approach, so popular subreddits are weighted less.Then the similarity of r/programming to r/coding would be based on two numbers:<pre><code> b = number of people subscribed to both r/coding and r/programming n = number of people subscribed to r/coding similarity = b/n</code></pre>

评论 #3719508 未加载

markkatabout 13 years ago

I'm not sure if it's a recommendation engine they need, but they do need a better way to find subreddits. IMO some sort of a map might work better than a rec engine. Or even just a quick way to see what subreddits another user subscribes to.

mumrahabout 13 years ago

I think that, given the volume of users on reddit and the volume of content they interact with, any of the various collaborative filtering techniques would work well at this point.You could take it a step further and incorporate more than explicit up/down vote features, such as "clicked", "commented", "saved", etc.Then incorporate some business rules that filter recommendations by subreddits, boost results by time, and now you have a decent recommender.Easier said than done of course.