Improving the Hacker News Ranking Algorithm

122 pointsby manxover 3 years ago

18 comments

slx26over 3 years ago

I'd be happy to see some experiments with ranking algorithms, maybe in a separate site, or as options within HN as someone else suggested. There are some downsides to both, though (potentially low exposure for the first, technical complexity for the second).What worries me is the definition of quality you use. We look for submissions that we find valuable for us, not necessarily high quality. Interests are varied, and we all get value from different things. Quality might be very correlated with value, but it's not quite the same. And here comes the big issue: maybe more than 50% of the value we derive from HN comes from the comments. I feel many times we are upvoting submissions by sheer relevance, so we can have valuable discussions about a relevant topic. I don't think the given metrics are capturing this. I like the analysis and the proposal, but it's easy to see how there's some very important perspective missing.

erglover 3 years ago

> [I]magine two submissions with the same number of upvotes and a different number of views. The article with more views probably has a lower quality than the one with fewer views, because more people have viewed it without giving an upvote. So the ratio of upvotes and views is a signal of quality.I believe this creates a bias against longer articles. If a submission links to a longer piece, a user could take a long time to come back to HN to upvote the submission.That's not to say this would be any worse than the current algorithm. By definition, a time-ranked frontpage and moving discussion will always favor shorter articles, or, if the content is long, produce plenty of comments that are only superficially related to the article.As a submission ages, its rank wanes, and discussion gets more sparse: a piece that takes time to read, understand and digest will probably perform worse than a short one, since the discussion and votes will take place further into the future.

评论 #28401792 未加载

评论 #28430680 未加载

dzinkover 3 years ago

Excellent analysis and I think the proposed algorithm would work well against clickbait titles. HN could trial several different algorithms over a period of a week each and watch the metrics. Thinking out loud:Part of HN’s charm is that niche content can surface to the top and many here are experts in their field, contributing great insights unavailable in normal media (example: pilots discussing possible causes after an airplane crash). That means the rest of us get to learn something new with perspectives from the best and many newbies become interested in more. That’s what top University classrooms feel like. I’m not sure what metric would quantify that quality (just # of comments as a metric encourages rage posting, maybe # comments with high upvotes?). The current system, although game-able and full of false negatives also has entropy built in - randomness that is more likely to give a wider range of content to the front page. That helps with genetic algorithms, so it might be of benefit, but missed if not quantified. If gaming the system counts - the most persistent also win some more.Another option to consider is duration of stay on articles. From testing it seems Google factors in duration of stay on content as search criteria as well, which skews rankings towards longer pages. Those also tend to be ad-supported and likely within Google’s own advertising network. Within HN, more comments increase duration of stay, so that would likely increase provocative content, so may not be a good idea, but testing would surface that and more.

pyentropyover 3 years ago

Have you tested the behaviour of your algorithm on pagination and dynamics of link position over a day?I assume the good thing about the snowball effect is that "making it to the frontpage" gives you guaranteed and pronounced exposure over a day or two - even though it's not maximizing the number of good links shown.

评论 #28398124 未加载

评论 #28396991 未加载

jillesvangurpover 3 years ago

Sounds good to me. I've had very little engagements (good or bad) on things I've submitted in the past that I thought should have been interesting to at least a few people. It seems there is a lot of content falling through that just isn't timed right to get somebody to push the up vote button. I had 1 upvote on a link I submitted last weekend. I suspect nobody actually saw the link or clicked on it and that this was not based on any merit. I've also had links submitted with no action only to see the same link submitted by somebody else making it to the front page much later.Duplicate links are actually interesting since they are so easy to detect (identical link). Why not simply aggregate metrics for those things? The important thing is the link making it to the front page without creating a lot of duplicates. Somebody submitting a duplicate would effectively become an up vote for the already submitted link. Views and duplicate submissions are possibly more significant signals than up votes.In search precision and recall are the two metrics people use for judging search quality. It's important to realize that hacker news is effectively a ranking algorithm and therefore is a search engine; even though it delegates actual search to Algolia. It sacrifices recall for precision: everything on the front page should be relatively high quality. But that's at the price of potentially high quality things never reaching it (recall).Of course, with only 30 slots on the front page, there's only so much that can be on it. Especially if you consider that many users only drop by once or twice a day or so. So those slots stay occupied for quite long as well. Days in some cases. The choice as to what is right is highly subjective (i.e. the moderators decide) and biased towards the intentions of the site and that's intentional. But that doesn't mean it can't be improved.

max_over 3 years ago

How about you let HN users choose what algo they want like how they choose their header Color.

评论 #28402573 未加载

评论 #28402312 未加载

jsnellover 3 years ago

> The low scores are a bit surprising because all submissions got enough votes to make it to the front-page.The entire premise of this section is incorrect. There is no such thing as "enough votes to make it the front-page". There's no single threshold, nor is the value of a single vote constant. You're going to have a lot of other effects like specific domains having score penalties, votes being ignored due to them appearing to be fraudulent / non-organic, or the votes arriving too diffuse over a long time period. (Duplicates of the same URL will count as a vote for something like 12 hours after submission, so it's quite easy for a link to get a long tail of votes even after dropping out from the first page of /new.)I would bet that most of what they claimed made it to the front-page never did. There used to be some HN stat-tracking sites around that had the full ranking history of submissions. Joining against that data would be a lot more credible.> To achieve this goal, the new-page should expose every submission to a certain amount of views, to estimate its eligibility for the front-page.That is obviously unworkable. /new is a slush pile, 90% of the submissions do not deserve even a single view based on the title. The proposal ensures that people visiting /new will only see the obvious garbage that nobody else has clicked on either.

评论 #28403157 未加载

pvgover 3 years ago

Are 'views' defined anywhere in this writeup? I couldn't quite figure out what the added ranking parameter 'views' exactly is and how it's measured.

评论 #28401174 未加载

llampxover 3 years ago

I noticed later submissions often had higher scores. Could this be a function of hn getting more popular over time?

评论 #28401198 未加载

potamicover 3 years ago

I have been thinking it would interesting to do the following experiment. Pick two random new posts and show them directly on the home page for a fixed amount of time. Feed the data from those posts (views, votes, content, comments) into some neural network which can magically learn what kind of posts generate most interest.Showing random, new posts on the home page provides equal viewership to the experiment set while removing the age bias. Many candidate posts will likely be low quality, but we can expect people to filter them out and participate less in them. High quality posts will organically attract people's attention and the algorithm can over time learn factors that differ between lower quality and higher quality posts.

评论 #28402063 未加载

评论 #28402517 未加载

streamofdigitsover 3 years ago

Does the assumption of a homogeneous pool of content and a corresponding homogeneous pool of readers affect this? Segmenting by topic and applying the algorithm per each pool might be necessary if one wants to have a "fair" assessment of quality.

评论 #28401544 未加载

评论 #28401481 未加载

krappover 3 years ago

If you want the staff to see this, email them at the contact link below.

评论 #28401100 未加载

yzanover 3 years ago

An obligatory link to an article by Evan Miller about ranking: <a href="https://www.evanmiller.org/how-not-to-sort-by-average-rating.html" rel="nofollow">https://www.evanmiller.org/how-not-to-sort-by-average-rating...</a>However, to get the following requirement from the article:> The algorithm should not produce false negatives, the community should find all high-quality content.it might be better to estimate the upper confidence bound, like in Upper Confidence Bound bandit, rather than the lower confidence bound.

thrower123over 3 years ago

One of these days, I am going to write a custom front end.The biggest problem I have with the front page these days is it gets clogged with junk science and low-effort news posts that produce entirely predictable flame wars that just aren't very interesting. There's about two dozen domains that account for 95% of such stories. I go through and flag/hide them now, and that works okay, but doing it automatically would be better.

yzanover 3 years ago

Also, the ranking algorithm should take into account the rank of the content when the upvote happened - 10 upvotes for a content on the 5th page is much more impressive than 10 upvotes for a content on the 1st page (keeping everything else constant).

throwamonover 3 years ago

I would love to test-drive this via a webapp. Any plans like it?

评论 #28401730 未加载

rukshnover 3 years ago

There are situations where some posts come to front page hours after going out from the new pageIt has happened to me a couple of times as well. That is clearly unexplainable.

评论 #28401510 未加载

cyberge99over 3 years ago

I typically browse via my rss reader / feed, so even though I’m here daily, I never see the front page (I do upvote on comment pages however).

评论 #28398318 未加载

18 comments

slx26over 3 years ago

erglover 3 years ago

评论 #28401792 未加载

评论 #28430680 未加载

dzinkover 3 years ago

pyentropyover 3 years ago

评论 #28398124 未加载

评论 #28396991 未加载

jillesvangurpover 3 years ago

max_over 3 years ago

How about you let HN users choose what algo they want like how they choose their header Color.

评论 #28402573 未加载

评论 #28402312 未加载

jsnellover 3 years ago

评论 #28403157 未加载

pvgover 3 years ago

Are 'views' defined anywhere in this writeup? I couldn't quite figure out what the added ranking parameter 'views' exactly is and how it's measured.

评论 #28401174 未加载

llampxover 3 years ago

I noticed later submissions often had higher scores. Could this be a function of hn getting more popular over time?

评论 #28401198 未加载

potamicover 3 years ago

评论 #28402063 未加载

评论 #28402517 未加载

streamofdigitsover 3 years ago

评论 #28401544 未加载

评论 #28401481 未加载

krappover 3 years ago

If you want the staff to see this, email them at the contact link below.

评论 #28401100 未加载

yzanover 3 years ago

thrower123over 3 years ago

yzanover 3 years ago

throwamonover 3 years ago

I would love to test-drive this via a webapp. Any plans like it?

评论 #28401730 未加载

rukshnover 3 years ago

There are situations where some posts come to front page hours after going out from the new pageIt has happened to me a couple of times as well. That is clearly unexplainable.

评论 #28401510 未加载

cyberge99over 3 years ago

I typically browse via my rss reader / feed, so even though I’m here daily, I never see the front page (I do upvote on comment pages however).

评论 #28398318 未加载