So, oddly enough, I've also been looking at HN front-page characteristics, based on the same corpus (the "past" page links). And that whole section on caveats over what that archive represents is something I could have written... The front page, both in its dynamic and archived forms is strongly subject to many influences in complex ways.<p>A couple of tips:<p>- It's possible to crawl the page using wget, given a reasonable delay. The full collection from 2007 to present (I'd done my first crawl in late May of this year) took a couple of days. Updates to that happen in seconds.<p>- I break down data by date, story position (e.g., rank 1--30), submitted site (if present), points (votes), comments, and submitter, as well as title.<p>- I'm working on classifying titles. The original question prompting my analysis was what US states get the most love from HN (NY, CA, WA*, TX, and CO are the top 5). I'd expanded that US and globally-significant cities, and been doing some tuple-based ngram analysis, though that gets pretty hairy.<p>For 2022 (most recent complete year), the top 40 submitted front-page sites are:<p><pre><code> 2022: Distinct sites: 6446
Site Stories Points ( mean ) Comments ( mean )
------------------------------ ------- ------ ---------- -------- ----------
n/a 432 167275 ( 386.32 ) 125304 ( 289.39 )
youtube.com 105 27243 ( 257.01 ) 12489 ( 117.82 )
nature.com 80 17694 ( 218.44 ) 11716 ( 144.64 )
wikipedia.org 68 12258 ( 177.65 ) 5855 ( 84.86 )
nytimes.com 67 21190 ( 311.62 ) 21765 ( 320.07 )
arstechnica.com 63 18319 ( 286.23 ) 12057 ( 188.39 )
ieee.org 53 9432 ( 174.67 ) 5933 ( 109.87 )
reuters.com 53 28360 ( 525.19 ) 29033 ( 537.65 )
theguardian.com 49 12228 ( 244.56 ) 8677 ( 173.54 )
quantamagazine.org 48 11293 ( 230.47 ) 5519 ( 112.63 )
science.org 47 12485 ( 260.10 ) 7655 ( 159.48 )
economist.com 46 12504 ( 266.04 ) 17324 ( 368.60 )
bloomberg.com 43 20037 ( 455.39 ) 20630 ( 468.86 )
lwn.net 43 10566 ( 240.14 ) 5912 ( 134.36 )
theverge.com 43 16313 ( 370.75 ) 14335 ( 325.80 )
arxiv.org 39 7415 ( 185.38 ) 3559 ( 88.97 )
washingtonpost.com 39 15778 ( 394.45 ) 18117 ( 452.93 )
bbc.com 37 11600 ( 305.26 ) 8696 ( 228.84 )
newyorker.com 37 7577 ( 199.39 ) 6549 ( 172.34 )
wsj.com 36 10920 ( 295.14 ) 11646 ( 314.76 )
wired.com 35 9104 ( 252.89 ) 6738 ( 187.17 )
archive.org 32 8011 ( 242.76 ) 4626 ( 140.18 )
gist.github.com 32 10287 ( 311.73 ) 5456 ( 165.33 )
reddit.com 30 12579 ( 405.77 ) 8457 ( 272.81 )
theregister.com 29 8288 ( 276.27 ) 4586 ( 152.87 )
apple.com 28 13245 ( 456.72 ) 12917 ( 445.41 )
github.blog 26 8398 ( 311.04 ) 4242 ( 157.11 )
cnbc.com 23 8568 ( 357.00 ) 10356 ( 431.50 )
phys.org 23 4918 ( 204.92 ) 2380 ( 99.17 )
theatlantic.com 23 7518 ( 313.25 ) 10643 ( 443.46 )
axios.com 22 8903 ( 387.09 ) 8616 ( 374.61 )
news.mit.edu 22 6181 ( 268.74 ) 2887 ( 125.52 )
smithsonianmag.com 22 4964 ( 215.83 ) 2988 ( 129.91 )
stanford.edu 22 8461 ( 367.87 ) 4720 ( 205.22 )
krebsonsecurity.com 21 6299 ( 286.32 ) 3331 ( 151.41 )
microsoft.com 21 7809 ( 354.95 ) 4392 ( 199.64 )
atlasobscura.com 20 2789 ( 132.81 ) 1637 ( 77.95 )
cnn.com 19 4704 ( 235.20 ) 4252 ( 212.60 )
righto.com 19 2568 ( 128.40 ) 795 ( 39.75 )
simonwillison.net 17 4878 ( 271.00 ) 1553 ( 86.28 )
</code></pre>
TechCrunch, BTW, lands at #41:<p><pre><code> techcrunch.com 17 8681 ( 482.28 ) 8224 ( 456.89 )
</code></pre>
(The "mean" values are the arithmetic mean of points (votes) and comments by domain.)<p>For 2023, there've only been 10 TechCrunch items (through 21-6-2023), well below trend:<p><pre><code> Ubuntu 22.04 LTS servers and phased apt updates
Twitterrific has been discontinued
DuckDB – An in-process SQL OLAP database management system
Shane Pitman, leader of the warez group Razor 1911: life after prison (2005)
Nearly 40% of software engineers will only work remotely
Htmx 1.9.0 has been released
Geometry Central: library of data structures, algorithms for geometry processing
Google Authenticator now supports Google Account synchronization
I Wrote an Activitypub Server in OCaml: Lessons Learnt, Weekends Lost
In New Paradox, Black Holes Appear to Evade Heat Death
</code></pre>
I'll note that breaking stories down by <i>site</i> will tend to obscure <i>categories</i>, as frequently-submitted sites (say, NY Times) will crowd out <i>many individual blogs</i>. I could probably do some manual classification based on sites, including, say, all categories of Twitter (currently broken out by user/account), and might look into that.<p>One of the most surprising facts to jump out to me is how much nytimes.com has fallen since 2019. It had previously been in the top-4 submitted sites pretty consistently, and single top for 2014--2019, but fell to 7th in 2020 and 9th in 2021, recovering to 5 in 2022.<p>I've also paired my own analysis with a 2022 study published by Whaly.io based on the HN API and <i>all</i> content submitted: <<a href="https://whaly.io/posts/hacker-news-2021-retrospective">https://whaly.io/posts/hacker-news-2021-retrospective</a>><p>I've been somewhat live-bloogging my analysis on the Fediverse under the #HackerNewsAnalytics hashtag:<p><<a href="https://toot.cat/@dredmorbius/tagged/HackerNewsAnalytics" rel="nofollow noreferrer">https://toot.cat/@dredmorbius/tagged/HackerNewsAnalytics</a>><p>That includes a number of findings (and testing/debugging notes), including: mentions of Reddit by year, mentions of the FP-500 companies (top-10: Apple, Microsoft, Amazon, Intel, Tesla, Netflix, IBM, Adobe, Oracle, and AT&T, though Google under various terms (Google, Alphabet, YouTube, Android) nearly doubles the top-ranked Apple, and no, adding in iPhone, iPad, MacBook, etc., doesn't help), trends in votes and comments by story position (interesting IMO), overall submission success rate (a hair under 3%), mentions of the FP Top 100 Global Thinkers in titles (reprising an old study of mine of numerous online sites), a look at the Leaders characteristics, what HN cares about being down, and, well, ... <i>things</i>: <<a href="https://toot.cat/@dredmorbius/110454128168815763" rel="nofollow noreferrer">https://toot.cat/@dredmorbius/110454128168815763</a>><p>________________________________<p>Notes:<p>* "Washington" can of course designate both a <i>city</i> and a <i>state</i>, amongst other things, and it turns out that the string is dominated by references to the <i>Washington Post</i>, much as "New York" is by the <i>New York Times</i>. But the list gives the naive ranking. Adding in "Silicon Valley" and "San Francisco" put California well on top.<p><i>Edits:</i> Some in situ updates as I think of things. Sorry!