Data Mining Hacker News: Front vs. Back

183 pointsby equilibriumover 8 years ago

20 comments

Really nice work David, good job.One suggestion: For each important conclusion try to have at least one sentence that is understandable by a business exec.For example at first glance it looks like time of day may be significant, then you conclude:"Because the p-value is greater than the alpha value, we fail to reject the null hypothesis that the two nominal categories are independent."By adding after this something like "Therefore submitting articles at a certain time of day is not an effective strategy to achieve front page visibility.", your post gains accessibility.This is not a nitpick. The idea is to make sure the full power of your analysis is felt across a broad section of readers. Even if you send to tech people, these things often find their way to a wider audience.

评论 #12685789 未加载

评论 #12685808 未加载

评论 #12685048 未加载

willvarfarover 8 years ago

Yeah these days it feels really random who ends up on the front page; there are just too many stories being submitted, and too few people filtering them :(Back when I was blogging I crunched the HN stats and tried to draw conclusions: <a href="http://williamedwardscoder.tumblr.com/post/18839832580/reddit-vs-hacker-news-vs-twitter" rel="nofollow">http://williamedwardscoder.tumblr.com/post/18839832580/reddi...</a>

评论 #12683905 未加载

minimaxirover 8 years ago

A couple years ago, I did my own analysis of all Hacker News submissions (<a href="http://minimaxir.com/2014/02/hacking-hacker-news/" rel="nofollow">http://minimaxir.com/2014/02/hacking-hacker-news/</a>) and also wrote a script around that time to get all data (<a href="https://github.com/minimaxir/get-all-hacker-news-submissions-comments" rel="nofollow">https://github.com/minimaxir/get-all-hacker-news-submissions...</a> , see also a modern dataset on Kaggle derived from it: <a href="https://www.kaggle.com/hacker-news/hacker-news-posts" rel="nofollow">https://www.kaggle.com/hacker-news/hacker-news-posts</a>). I only looked at the # of points as a metric for quality, so front v. back with this approach is interesting. Given the good work in this post, I may take another look at the data myself.This is a case where the sample size used may be problematic. "425 fronts against 570 corresponding backs" (n = 995), in the grand scheme of Hacker News, is not a lot, even if statistical analysis permits it (example: the by-hour Chi-Sq test, which barely hits the 5-per-cell assumption). Given the method of collection by scraping the front page directly, this is understandable, though.However, that presents a problem. the front-page algorithm has changed in recent months and I myself have had difficulty predicting what makes the front page and what doesn't (and what ends up making the front page hours after being submitted for no reason). With relatively new features like the second-chance pool and explicit dupe marking, there is new quality control of the front page thanks to dang/sctb. That is another issue of looking at a small subset of HN data; it does not reflect the site as a whole, although looking at more-recent data might be more beneficial for optimizing one's own posts.

评论 #12684504 未加载

评论 #12684168 未加载

lewisjoeover 8 years ago

Great job! I've been working on a related tool that could be useful as well.<a href="http://hnlive.tk/static/index.html" rel="nofollow">http://hnlive.tk/static/index.html</a> is a "live" HN activity meter.I wrote it for myself. Anytime before posting to HN, I use it to decide if the activity on the site is high enough. Right now the graph says, current time has the highest activity spike in the past 24 hours.It's far from done. I'm yet to plot answers to few more common questions, backed by realtime data. Like say,+ Which weekday had the highest activity, last week?+ Which weekday usually has high activity?+ What time slot last week had the highest activity spike?

tonylemesmerover 8 years ago

I generally only visit the "new" stories page if I don't find many interesting front page items. So I wonder if there is a correlation there. Amount of browsing time available vs. promotion of new items to front page.

gus_massaover 8 years ago

I like the analysis, but I wonder if the criteria is week enough to detect the dupes from medium.com , because medium adds some tracking crap to the URL that confuses the dupe detector of HN. For example see: <a href="https://hn.algolia.com/?query=I%20Peeked%20into%20My%20Node_Modules%20Directory%20and%20You%20Wont%20Believe%20What%20Happened%20Next&sort=byDate&dateRange=all&type=story&storyText=false&prefix=false&page=0" rel="nofollow">https://hn.algolia.com/?query=I%20Peeked%20into%20My%20Node_...</a> (this list doesn't include many dupes that were detected and marked).A problem with this analysis is that it doesn't count the dupes that never had a sibling that get to the front page. Counting it would modify the distribution of some domains and submitters.

评论 #12687558 未加载

评论 #12687559 未加载

iraldirover 8 years ago

While it's a very interesting analysis, it kinda reinforce the idea that it's down to luck. Sure you can make your post on the week end to increase your chances (even though I don't understand that given there is only so many room on the hot page, if every one is more likely to go on the top page then no one is). I think it's just the matter that people going to the "news" section tend to upvote the link that are already upvoted. The only way to increase that is to artificially bump up the upvotes by asking friends from different parts of the world to upvote your article while it's in the news section (note that if you cannot give them a link to your post directly or their vote won't be taken into account).

评论 #12683962 未加载

keyleover 8 years ago

Impressive research. I don't really mind the repost if the article went in fact in oblivion while we should have paid attention. A gentle reminder for everyone to sometimes visit the 'new' section and upvote the interesting part.Maybe an AI data mining process could know what's interesting based on.... wait, no, that's a bad idea :)

jstanleyover 8 years ago

Good investigation but would be nice to see more in the way of conclusions that can be drawn.

bradorover 8 years ago

I wonder if you could analyze this data to extract moderation information (for example when mods changed, or when mod activity level changed). It would be interesting to identify data spikes, and try to understand why.

MeteorMarcover 8 years ago

I think it would be nice to add the time delay upon the first upvote as a feature in the analysis. Whenever checking the "new" page, I tend to look at the items which already had an upvote.

chirauover 8 years ago

Monitored tag filters, like on <a href="https://lobste.rs/" rel="nofollow">https://lobste.rs/</a>, would be great for HN. There is too much randomness on HN now.

gcrover 8 years ago

To what extent will the publication of this article change HN trends to make its conclusions invalid?If everyone reads this article and then follows its recommendations, wouldn't HN posting strategy change?

overcastover 8 years ago

Side discussion, does anyone know what type of code highlighting library they are using for this? Looks like server side processing, and then outputs the html/css.

评论 #12684136 未加载

dredmorbiusover 8 years ago

Interesting study, it suggests a few dynamics.The weekday data show a high back-page rate for Tuesday, and a high frontpage rate for weekend posts (Saturday/Sunday). This suggests to me that the total volume of posts, a statistic not presented (that I noticed) might have some bearing. Specifically, many PR firms and other seakers of publicity tend to target Tuesday morning for positive items, as these beat the Monday rush (and blues), but allow for time to process during the rest of the week. And professional submitters are going to be quiet on weekends. If I had to hazard a guess, I'd suggest that HN attracts a significant amount of direct or indirect RP blitzing. My thought is that PR pieces are, in general, less likely to be voted to front page than organic content -- where PR includes low-quality blog, YouTube, marketing, and similar type content.The time-of-day analysis suggests something similar. Traffic begins to pick up at about 0400 system time, which is US/Pacific. That would be 7am East Coast (morning breakfast/commute) and about 10am in Europe, suggesting there's traffic arriving from those locations. There's also a pretty noticeable dip in backs ratio around the noon hour, plus or minus, and a slight increase in the early afternoon. Again, PR / SEO content might take a mid-day break within the US.As for "new" page reconfigurations, a concern I've had is that as submissions increase, the latency of any given item on the page decreases -- well under an hour at peak times. Odds of even a good item collecting upvotes is small.An alternative presentation might be to randomly shard submissions such that each is present on the page for at least some period of time, for some fraction of HN users. A hash of UID (or some other arbitrary value) and shard assignment, weighted by the predicted voting on the item, would present each unvoted and low-voted submission to a small set of users, but over a longer period of time, while increased positive votes would expand the exposure category. The idea being that each piece has a more realistic opportunity for exposure. Flags would remove from scores.HN does a good job of (usually) promoting quality and interesting content. It does have a high false-negative rate, in not promoting good content, which is a problem. On the other hand, there are very real limits to how much content a pereson can handle in a day, and simply opening the firehose wider isn't a viable solution.Based on counts of daily emails from Stephen Wolfram and Walt Mossberg, and The New York Times moderation desk volumes, I'm seeing ~150 - 300 emails, or <800 comment moderations, per day, as something of a pertty consistent upper bound to meaningful content interaction, and that 800 is a pretty low value of "meaningful" at about 36 seconds per item. HN's front page with 30 solid articles is a pretty reasonable target for deeper material.

appleflaxenover 8 years ago

the plot of differences would benefit from showing negative values - so that fronts > backs is positive (like sat, sun) and backs > fronts is negative (like tuesday).The way it's currently shon (magnitude or abs value) requires a lot of cognitive load to parse that could be intuitive.

immixGover 8 years ago

This was very interesting - a bit confusing but helpful!

michaelknightover 8 years ago

would like to see the next version in a week or two to see how your article affected the numbers.(im sure from now on we will see a pike on Tue 6 and 11 on posts)

Exumaover 8 years ago

Very well put together post. Great work.

Raphmediaover 8 years ago

Why is that "posts" button moving so much? I can't focus on the text at all. I had to inspect the page and remove the animation in order to be able to focus.

评论 #12687902 未加载

评论 #12685503 未加载