You Wouldn't Download a Hacker News

458 pointsby jasonthorsness18 days ago

28 comments

There's also two DBs I know of that have an updated Hacker News table for running analytics on without needing to download it first.- BigQuery, (requires Google Cloud account, querying will be free tier I'd guess) — `bigquery-public-data.hacker_news.full`- ClickHouse, no signup needed, can run queries in browser directly, [1][1] <a href="https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPTSBoYWNrZXJuZXdzX2hpc3RvcnkgV0hFUkUgbG93ZXIodGV4dCkgTElLRSAnJXB5dGhvbiUnIE9SREVSIEJZIHRpbWUgREVTQyBMSU1JVCAxMA==" rel="nofollow">https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...</a>

评论 #43845274 未加载

评论 #43844936 未加载

评论 #43853964 未加载

mattkevan18 days ago

I did something similar a while back to the @fesshole Twitter/Bluesky account. Downloaded the entire archive and fine-tuned a model on it to create more unhinged confessions.Was feeling pretty pleased with myself until I realised that all I’d done was teach an innocent machine about wanking and divorce. Felt like that bit in a sci-fi movie where the alien/super-intelligent AI speed-watches humanity’s history and decides we’re not worth saving after all.

评论 #43846028 未加载

评论 #43843467 未加载

jakegmaths18 days ago

Your query for Java will include all instances of JavaScript as well, so you're over representing Java.

评论 #43841776 未加载

评论 #43841454 未加载

userbinator18 days ago

I had a 20 GiB JSON file of everything that has ever happened on Hacker NewsI'm actually surprised at that volume, given this is a text-only site. Humans have managed to post over 20 billion bytes of text to it over the 18 years that HN existed? That averages to over 2MB per day, or around 7.5KB/s.

评论 #43842525 未加载

评论 #43847241 未加载

评论 #43853451 未加载

评论 #43847099 未加载

评论 #43844980 未加载

评论 #43853791 未加载

SilverBirch18 days ago

What is the netiquette of downloading HN? Do you ping Dang and ask him before you blow up his servers? Or do you just assume at this point that every billion dollar tech company is doing this many times over so you probably won't even be noticed?

评论 #43843103 未加载

评论 #43848155 未加载

评论 #43842828 未加载

评论 #43846538 未加载

评论 #43844964 未加载

评论 #43845166 未加载

flakiness18 days ago

I have done something similar. I cheated to use BigQuery dataset (which somehow keeps getting updated) and export the data to parquet, download it and query it using duckdb.

评论 #43841740 未加载

bambax18 days ago

> Now that I have a local download of all Hacker News content, I can train hundreds of LLM-based bots on it and run them as contributors, slowly and inevitably replacing all human text with the output of a chinese room oscillator perpetually echoing and recycling the past.The author said this in jest, but I fear someone, someday, will try this; I hope it never happens but if it does, could we stop it?

评论 #43843000 未加载

评论 #43842922 未加载

评论 #43842858 未加载

评论 #43844282 未加载

评论 #43843696 未加载

评论 #43844206 未加载

评论 #43842971 未加载

评论 #43847912 未加载

评论 #43842942 未加载

评论 #43846340 未加载

评论 #43842845 未加载

评论 #43844260 未加载

评论 #43843136 未加载

评论 #43845157 未加载

评论 #43843538 未加载

评论 #43842961 未加载

评论 #43851704 未加载

g8oz17 days ago

I predict that in the coming years a lot of APIs will begin offer the option of just returning a duckdb file. If you're just going to load the json into a database anyway, why not just get a database in the response.

评论 #43855632 未加载

stefs18 days ago

please do not use stacked charts! i think it's close to impossible to not to distort the readers impression because a) it's very hard to gauge the height of a certain data point in the noise and b) they're implying a dependency where there _probably_ is none.

评论 #43842867 未加载

评论 #43845394 未加载

评论 #43855973 未加载

评论 #43842817 未加载

ashish0118 days ago

I wrote one a while back <a href="https://github.com/ashish01/hn-data-dumps">https://github.com/ashish01/hn-data-dumps</a> and it was a lot of fun. One thing which will be cool to implement is that more recent items will update more over time making any recent downloaded items more stale than older ones.

评论 #43841417 未加载

wslh17 days ago

It would be great if it is available as a torrent. There also mutable torrents [1]. Not implemented everywhere but there are available ones [2].[1] <a href="https://www.bittorrent.org/beps/bep_0046.html" rel="nofollow">https://www.bittorrent.org/beps/bep_0046.html</a>[2] <a href="https://www.npmjs.com/package/bittorrent-dht" rel="nofollow">https://www.npmjs.com/package/bittorrent-dht</a>

Am4TIfIsER0ppos17 days ago

I hope they snatched my flagged comments. I would be pleased to have helped make the AI into an asshole. Here's hoping for another Tay AI.

shayway18 days ago

Hah, I've been scraping HN over the past couple weeks to do something similar! Only submissions though, not comments. It was after I went to /newest and was faced with roughly 9/10 posts being AI-related. I was curious what the actual percentage of posts on HN were about AI, and also how it compared to other things heavily hyped in the past like Web3 and crypto.

评论 #43846733 未加载

9rx18 days ago

> The Rise Of RustShouldn't that be The Fall Of Rust? According to this, it saw the most attention during the years before it was created!

评论 #43842219 未加载

hsbauauvhabzb18 days ago

Is the raw dataset available anywhere? I really don’t like the HN search function, and grepping through the data would be handy.

评论 #43842520 未加载

deadbabe18 days ago

Is the 20GB JSON file available?

byearthithatius17 days ago

Can you scrape all of HN by just incrementing item?id (since its sequential) and using Python web requests with IP rotation (in case there is rate limiting)?NVM this approach of going item by item would take 460 days if the average request response time is 1 second (unless heavily parallelized, for instance 500 instances _could_ do it in a day but thats 40 million requests either way so would raise alarms).

matsemann18 days ago

One thing I'm curious about, but I guess not visible in any way, is random stats about my own user/usage of the site. What's my upvote/downvote ratio? Are there users I constantly upvote/downvote? Who is liking/hating my comments the most? And some I guessed could be scrapable: Which days/times are I the most active (like the github green grid thingy)? How's my activity changed over the years?

评论 #43843011 未加载

评论 #43841750 未加载

评论 #43842091 未加载

评论 #43845098 未加载

评论 #43842102 未加载

xnx17 days ago

I have this data and a bunch of interesting analysis to share. Any suggestions on the best method to share results?I like Tableau Public, because it allows for interactivity and exploration, but it can't handle this many rows of data.Is there a good tool for making charts directly from Clickhouse data?

评论 #43845657 未加载

mike50317 days ago

Other people have asked, probably for the same reason but I would love an offline version, packaged in zim format or something.For when the apocalypse happens it’ll be enjoyable to read relatively high quality interactions and some of them may include useful post-apoc tidbits!

tacker200018 days ago

Yea, i also get the feeling that these rust evangelists get more annoying every day ;p

sebastianmestre17 days ago

Can you remake the stacked graphs with the variable of interest at the bottom? Its hard to see the percentage of Rust when it's all the way at the top with a lot of noise on the lower layersEdit: or make a non-stacked version?

评论 #43845358 未加载

dredmorbius17 days ago

I've been tempted to look into API-based HN access having scraped the front-page archive about two years ago.One of the advantages of comments is that there's simply so much more text to work with. For the front page, there is up to 80 characters of context (often deliberately obtuse), as well as metadata (date, story position, votes, site, submitter).I'd initially embarked on the project to find out what cities were mentioned most often on HN (in front-page titles), though it turned out to be a much more interesting project than I'd anticipated.(I've somewhat neglected it for a while though I'll occasionally spin it up to check on questions or ideas.)

febeling17 days ago

You wonder what all the Rust talk was about before the programming language's release in Jan 2012.

评论 #43855029 未加载

andrewshadura18 days ago

Funny nobody's mentioned "correct horse battery staple" in the comments yet…

th1nhng017 days ago

Can I ask how you draw the chart in the post?

评论 #43849338 未加载

pier2518 days ago

would love to see the graph of React, Vue, Angular, and Svelte

评论 #43854309 未加载

a3w18 days ago

Cool project. Cool graphs.But any GDPR requests for info and deletion in your inbox, yet?

评论 #43848447 未加载