So that others can play with the data, here's a reverse engineering of the BigQuery OP used to create the leaderboard:<p><pre><code> #standardSQL
SELECT
domain,
COUNT(*) AS num_posts,
perc_75,
AVG(score) AS avg_score,
(AVG(score) + 2*perc_75) * LOG(COUNT(*)) AS calc_score
FROM (
SELECT
REGEXP_REPLACE(NET.HOST(url), 'www.', '') AS domain,
score,
PERCENTILE_CONT(score,
0.75) OVER (PARTITION BY REGEXP_REPLACE(NET.HOST(url), 'www.', '')) AS perc_75
FROM
`bigquery-public-data.hacker_news.full`
WHERE
type = 'story'
AND url IS NOT NULL )
GROUP BY
domain,
perc_75
ORDER BY
calc_score DESC
</code></pre>
Top 10000 results: <a href="https://docs.google.com/spreadsheets/d/1Z9atmizTAPkgFiBte2eQiQgxEAiMyMB7Q99fzMzfIJs/edit?usp=sharing" rel="nofollow">https://docs.google.com/spreadsheets/d/1Z9atmizTAPkgFiBte2eQ...</a><p>(it's apparently not a perfect match since there appears to be a minimum # of posts requirement for domains [e.g. without that requirement, <a href="https://news.ycombinator.com/from?site=pardonsnowden.org" rel="nofollow">https://news.ycombinator.com/from?site=pardonsnowden.org</a> is #3], which should be added to the description of the leaderboard)
Very cool, thanks for sharing! I did a somewhat similar analysis a while back [1], and I found that many of the top domains either had a YC affiliation or corresponded to extremely well-known companies or organizations. This made me interested in finding lesser known blogs that also produce high quality content. I tried to identify these by putting a limit on the number of unique users who had submitted content from each domain. My thinking here was that something like the GitHub blog would have submissions from many users, while smaller personal blogs would probably be mostly self-promoted. Using this approach, I was able to turn up some pretty interesting blogs that I had never heard of before.<p>I think it could really increase the usefulness of HN Domain Leaderboard if you added some additional filtering capabilities. Filtering based on the category would probably be pretty easy because you have that information there already, but perhaps also consider some measure of how broadly promoted each domain is. The time range option is already pretty cool, and I'll bet that a few more options would make it even more fun to play around with.<p>[1] - <a href="https://intoli.com/blog/pareto-optimal-blogs/" rel="nofollow">https://intoli.com/blog/pareto-optimal-blogs/</a>
I'd really like to see the opposite of this: domains that have been flagged multiple times and have a high submissions-to-upvotes ratio so that I can filter them out.
It would be great if you could add top posts from each of these domains. I am really interested to see the top content I may have missed from a few of these domains.
This is a little out of date but may be of interest here. This is a visualization of the top 10,000 HN posts <a href="https://www.sizzleanalytics.com/Boards/sizzle/Hacker-News-Top-Posts-All-Time/dfb2af8e-67fa-47a7-892c-435de6321378" rel="nofollow">https://www.sizzleanalytics.com/Boards/sizzle/Hacker-News-To...</a>
I would have thought bravenewgeek.com would make it onto the leaderboard since his posts [1] are typically high quality.<p>[1] <a href="https://news.ycombinator.com/from?site=bravenewgeek.com" rel="nofollow">https://news.ycombinator.com/from?site=bravenewgeek.com</a>
Ah! was searching around for exactly this just a week ago and gave up. Could you add more granular date filters? (past month past week etc?)
thanks for doing it!
Interesting that there are no News related domains in the list. I wounder if that is due to the number of posts those domains have that never gain any traction.
Is mean a valid statistic for this dataset?<p>I suspect that the score a link gets is highly variable and doesn't follow a known distribution, therefore, taking a straight mean may not be a valid thing to do, or at the very least, very very skewed.<p>That being said, cool idea, well executed.
Interesting that so many of the top sites are "individual". I always thought that self promotion was shunned on places like HN, but I guess if you do it in the "right" way, it can be a successful tactic.