TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

What gets to the front page of Hacker News? A data project

6 点作者 itunpredictable将近 2 年前

1 comment

dredmorbius将近 2 年前
So, oddly enough, I&#x27;ve also been looking at HN front-page characteristics, based on the same corpus (the &quot;past&quot; page links). And that whole section on caveats over what that archive represents is something I could have written... The front page, both in its dynamic and archived forms is strongly subject to many influences in complex ways.<p>A couple of tips:<p>- It&#x27;s possible to crawl the page using wget, given a reasonable delay. The full collection from 2007 to present (I&#x27;d done my first crawl in late May of this year) took a couple of days. Updates to that happen in seconds.<p>- I break down data by date, story position (e.g., rank 1--30), submitted site (if present), points (votes), comments, and submitter, as well as title.<p>- I&#x27;m working on classifying titles. The original question prompting my analysis was what US states get the most love from HN (NY, CA, WA*, TX, and CO are the top 5). I&#x27;d expanded that US and globally-significant cities, and been doing some tuple-based ngram analysis, though that gets pretty hairy.<p>For 2022 (most recent complete year), the top 40 submitted front-page sites are:<p><pre><code> 2022: Distinct sites: 6446 Site Stories Points ( mean ) Comments ( mean ) ------------------------------ ------- ------ ---------- -------- ---------- n&#x2F;a 432 167275 ( 386.32 ) 125304 ( 289.39 ) youtube.com 105 27243 ( 257.01 ) 12489 ( 117.82 ) nature.com 80 17694 ( 218.44 ) 11716 ( 144.64 ) wikipedia.org 68 12258 ( 177.65 ) 5855 ( 84.86 ) nytimes.com 67 21190 ( 311.62 ) 21765 ( 320.07 ) arstechnica.com 63 18319 ( 286.23 ) 12057 ( 188.39 ) ieee.org 53 9432 ( 174.67 ) 5933 ( 109.87 ) reuters.com 53 28360 ( 525.19 ) 29033 ( 537.65 ) theguardian.com 49 12228 ( 244.56 ) 8677 ( 173.54 ) quantamagazine.org 48 11293 ( 230.47 ) 5519 ( 112.63 ) science.org 47 12485 ( 260.10 ) 7655 ( 159.48 ) economist.com 46 12504 ( 266.04 ) 17324 ( 368.60 ) bloomberg.com 43 20037 ( 455.39 ) 20630 ( 468.86 ) lwn.net 43 10566 ( 240.14 ) 5912 ( 134.36 ) theverge.com 43 16313 ( 370.75 ) 14335 ( 325.80 ) arxiv.org 39 7415 ( 185.38 ) 3559 ( 88.97 ) washingtonpost.com 39 15778 ( 394.45 ) 18117 ( 452.93 ) bbc.com 37 11600 ( 305.26 ) 8696 ( 228.84 ) newyorker.com 37 7577 ( 199.39 ) 6549 ( 172.34 ) wsj.com 36 10920 ( 295.14 ) 11646 ( 314.76 ) wired.com 35 9104 ( 252.89 ) 6738 ( 187.17 ) archive.org 32 8011 ( 242.76 ) 4626 ( 140.18 ) gist.github.com 32 10287 ( 311.73 ) 5456 ( 165.33 ) reddit.com 30 12579 ( 405.77 ) 8457 ( 272.81 ) theregister.com 29 8288 ( 276.27 ) 4586 ( 152.87 ) apple.com 28 13245 ( 456.72 ) 12917 ( 445.41 ) github.blog 26 8398 ( 311.04 ) 4242 ( 157.11 ) cnbc.com 23 8568 ( 357.00 ) 10356 ( 431.50 ) phys.org 23 4918 ( 204.92 ) 2380 ( 99.17 ) theatlantic.com 23 7518 ( 313.25 ) 10643 ( 443.46 ) axios.com 22 8903 ( 387.09 ) 8616 ( 374.61 ) news.mit.edu 22 6181 ( 268.74 ) 2887 ( 125.52 ) smithsonianmag.com 22 4964 ( 215.83 ) 2988 ( 129.91 ) stanford.edu 22 8461 ( 367.87 ) 4720 ( 205.22 ) krebsonsecurity.com 21 6299 ( 286.32 ) 3331 ( 151.41 ) microsoft.com 21 7809 ( 354.95 ) 4392 ( 199.64 ) atlasobscura.com 20 2789 ( 132.81 ) 1637 ( 77.95 ) cnn.com 19 4704 ( 235.20 ) 4252 ( 212.60 ) righto.com 19 2568 ( 128.40 ) 795 ( 39.75 ) simonwillison.net 17 4878 ( 271.00 ) 1553 ( 86.28 ) </code></pre> TechCrunch, BTW, lands at #41:<p><pre><code> techcrunch.com 17 8681 ( 482.28 ) 8224 ( 456.89 ) </code></pre> (The &quot;mean&quot; values are the arithmetic mean of points (votes) and comments by domain.)<p>For 2023, there&#x27;ve only been 10 TechCrunch items (through 21-6-2023), well below trend:<p><pre><code> Ubuntu 22.04 LTS servers and phased apt updates Twitterrific has been discontinued DuckDB – An in-process SQL OLAP database management system Shane Pitman, leader of the warez group Razor 1911: life after prison (2005) Nearly 40% of software engineers will only work remotely Htmx 1.9.0 has been released Geometry Central: library of data structures, algorithms for geometry processing Google Authenticator now supports Google Account synchronization I Wrote an Activitypub Server in OCaml: Lessons Learnt, Weekends Lost In New Paradox, Black Holes Appear to Evade Heat Death </code></pre> I&#x27;ll note that breaking stories down by <i>site</i> will tend to obscure <i>categories</i>, as frequently-submitted sites (say, NY Times) will crowd out <i>many individual blogs</i>. I could probably do some manual classification based on sites, including, say, all categories of Twitter (currently broken out by user&#x2F;account), and might look into that.<p>One of the most surprising facts to jump out to me is how much nytimes.com has fallen since 2019. It had previously been in the top-4 submitted sites pretty consistently, and single top for 2014--2019, but fell to 7th in 2020 and 9th in 2021, recovering to 5 in 2022.<p>I&#x27;ve also paired my own analysis with a 2022 study published by Whaly.io based on the HN API and <i>all</i> content submitted: &lt;<a href="https:&#x2F;&#x2F;whaly.io&#x2F;posts&#x2F;hacker-news-2021-retrospective">https:&#x2F;&#x2F;whaly.io&#x2F;posts&#x2F;hacker-news-2021-retrospective</a>&gt;<p>I&#x27;ve been somewhat live-bloogging my analysis on the Fediverse under the #HackerNewsAnalytics hashtag:<p>&lt;<a href="https:&#x2F;&#x2F;toot.cat&#x2F;@dredmorbius&#x2F;tagged&#x2F;HackerNewsAnalytics" rel="nofollow noreferrer">https:&#x2F;&#x2F;toot.cat&#x2F;@dredmorbius&#x2F;tagged&#x2F;HackerNewsAnalytics</a>&gt;<p>That includes a number of findings (and testing&#x2F;debugging notes), including: mentions of Reddit by year, mentions of the FP-500 companies (top-10: Apple, Microsoft, Amazon, Intel, Tesla, Netflix, IBM, Adobe, Oracle, and AT&amp;T, though Google under various terms (Google, Alphabet, YouTube, Android) nearly doubles the top-ranked Apple, and no, adding in iPhone, iPad, MacBook, etc., doesn&#x27;t help), trends in votes and comments by story position (interesting IMO), overall submission success rate (a hair under 3%), mentions of the FP Top 100 Global Thinkers in titles (reprising an old study of mine of numerous online sites), a look at the Leaders characteristics, what HN cares about being down, and, well, ... <i>things</i>: &lt;<a href="https:&#x2F;&#x2F;toot.cat&#x2F;@dredmorbius&#x2F;110454128168815763" rel="nofollow noreferrer">https:&#x2F;&#x2F;toot.cat&#x2F;@dredmorbius&#x2F;110454128168815763</a>&gt;<p>________________________________<p>Notes:<p>* &quot;Washington&quot; can of course designate both a <i>city</i> and a <i>state</i>, amongst other things, and it turns out that the string is dominated by references to the <i>Washington Post</i>, much as &quot;New York&quot; is by the <i>New York Times</i>. But the list gives the naive ranking. Adding in &quot;Silicon Valley&quot; and &quot;San Francisco&quot; put California well on top.<p><i>Edits:</i> Some in situ updates as I think of things. Sorry!
评论 #36524108 未加载
评论 #36524203 未加载