There's also two DBs I know of that have an updated Hacker News table for running analytics on without needing to download it first.<p>- BigQuery, (requires Google Cloud account, querying will be free tier I'd guess) — `bigquery-public-data.hacker_news.full`<p>- ClickHouse, no signup needed, can run queries in browser directly, [1]<p>[1] <a href="https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPTSBoYWNrZXJuZXdzX2hpc3RvcnkgV0hFUkUgbG93ZXIodGV4dCkgTElLRSAnJXB5dGhvbiUnIE9SREVSIEJZIHRpbWUgREVTQyBMSU1JVCAxMA==" rel="nofollow">https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...</a>
I did something similar a while back to the @fesshole Twitter/Bluesky account. Downloaded the entire archive and fine-tuned a model on it to create more unhinged confessions.<p>Was feeling pretty pleased with myself until I realised that all I’d done was teach an innocent machine about wanking and divorce. Felt like that bit in a sci-fi movie where the alien/super-intelligent AI speed-watches humanity’s history and decides we’re not worth saving after all.
<i>I had a 20 GiB JSON file of everything that has ever happened on Hacker News</i><p>I'm actually surprised at that volume, given this is a text-only site. Humans have managed to post <i>over 20 billion bytes</i> of text to it over the 18 years that HN existed? That averages to over 2MB per day, or around 7.5KB/s.
What is the netiquette of downloading HN? Do you ping Dang and ask him before you blow up his servers? Or do you just assume at this point that every billion dollar tech company is doing this many times over so you probably won't even be noticed?
I have done something similar. I cheated to use BigQuery dataset (which somehow keeps getting updated) and export the data to parquet, download it and query it using duckdb.
> <i>Now that I have a local download of all Hacker News content, I can train hundreds of LLM-based bots on it and run them as contributors, slowly and inevitably replacing all human text with the output of a chinese room oscillator perpetually echoing and recycling the past.</i><p>The author said this in jest, but I fear someone, someday, will try this; I hope it never happens but if it does, could we stop it?
I predict that in the coming years a lot of APIs will begin offer the option of just returning a duckdb file. If you're just going to load the json into a database anyway, why not just get a database in the response.
please do not use stacked charts! i think it's close to impossible to not to distort the readers impression because a) it's very hard to gauge the height of a certain data point in the noise and b) they're implying a dependency where there _probably_ is none.
I wrote one a while back <a href="https://github.com/ashish01/hn-data-dumps">https://github.com/ashish01/hn-data-dumps</a> and it was a lot of fun. One thing which will be cool to implement is that more recent items will update more over time making any recent downloaded items more stale than older ones.
It would be great if it is available as a torrent. There also mutable torrents [1]. Not implemented everywhere but there are available ones [2].<p>[1] <a href="https://www.bittorrent.org/beps/bep_0046.html" rel="nofollow">https://www.bittorrent.org/beps/bep_0046.html</a><p>[2] <a href="https://www.npmjs.com/package/bittorrent-dht" rel="nofollow">https://www.npmjs.com/package/bittorrent-dht</a>
Hah, I've been scraping HN over the past couple weeks to do something similar! Only submissions though, not comments. It was after I went to /newest and was faced with roughly 9/10 posts being AI-related. I was curious what the actual percentage of posts on HN were about AI, and also how it compared to other things heavily hyped in the past like Web3 and crypto.
<i>> The Rise Of Rust</i><p>Shouldn't that be The Fall Of Rust? According to this, it saw the most attention during the years before it was created!
Can you scrape all of HN by just incrementing item?id (since its sequential) and using Python web requests with IP rotation (in case there is rate limiting)?<p>NVM this approach of going item by item would take 460 days if the average request response time is 1 second (unless heavily parallelized, for instance 500 instances _could_ do it in a day but thats 40 million requests either way so would raise alarms).
One thing I'm curious about, but I guess not visible in any way, is random stats about my own user/usage of the site. What's my upvote/downvote ratio? Are there users I constantly upvote/downvote? Who is liking/hating my comments the most? And some I guessed could be scrapable: Which days/times are I the most active (like the github green grid thingy)? How's my activity changed over the years?
I have this data and a bunch of interesting analysis to share. Any suggestions on the best method to share results?<p>I like Tableau Public, because it allows for interactivity and exploration, but it can't handle this many rows of data.<p>Is there a good tool for making charts directly from Clickhouse data?
Other people have asked, probably for the same reason but I would love an offline version, packaged in zim format or something.<p>For when the apocalypse happens it’ll be enjoyable to read relatively high quality interactions and some of them may include useful post-apoc tidbits!
Can you remake the stacked graphs with the variable of interest at the bottom? Its hard to see the percentage of Rust when it's all the way at the top with a lot of noise on the lower layers<p>Edit: or make a non-stacked version?
I've been tempted to look into API-based HN access having scraped the front-page archive about two years ago.<p>One of the advantages of comments is that there's simply <i>so much more text</i> to work with. For the front page, there is <i>up to</i> 80 characters of context (often deliberately obtuse), as well as metadata (date, story position, votes, site, submitter).<p>I'd initially embarked on the project to find out what cities were mentioned most often on HN (in front-page titles), though it turned out to be a much more interesting project than I'd anticipated.<p>(I've somewhat neglected it for a while though I'll occasionally spin it up to check on questions or ideas.)