TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

You Wouldn't Download a Hacker News

458 点作者 jasonthorsness12 天前

28 条评论

montebicyclelo11 天前
There&#x27;s also two DBs I know of that have an updated Hacker News table for running analytics on without needing to download it first.<p>- BigQuery, (requires Google Cloud account, querying will be free tier I&#x27;d guess) — `bigquery-public-data.hacker_news.full`<p>- ClickHouse, no signup needed, can run queries in browser directly, [1]<p>[1] <a href="https:&#x2F;&#x2F;play.clickhouse.com&#x2F;play?user=play#U0VMRUNUICogRlJPTSBoYWNrZXJuZXdzX2hpc3RvcnkgV0hFUkUgbG93ZXIodGV4dCkgTElLRSAnJXB5dGhvbiUnIE9SREVSIEJZIHRpbWUgREVTQyBMSU1JVCAxMA==" rel="nofollow">https:&#x2F;&#x2F;play.clickhouse.com&#x2F;play?user=play#U0VMRUNUICogRlJPT...</a>
评论 #43845274 未加载
评论 #43844936 未加载
评论 #43853964 未加载
mattkevan11 天前
I did something similar a while back to the @fesshole Twitter&#x2F;Bluesky account. Downloaded the entire archive and fine-tuned a model on it to create more unhinged confessions.<p>Was feeling pretty pleased with myself until I realised that all I’d done was teach an innocent machine about wanking and divorce. Felt like that bit in a sci-fi movie where the alien&#x2F;super-intelligent AI speed-watches humanity’s history and decides we’re not worth saving after all.
评论 #43846028 未加载
评论 #43843467 未加载
jakegmaths12 天前
Your query for Java will include all instances of JavaScript as well, so you&#x27;re over representing Java.
评论 #43841776 未加载
评论 #43841454 未加载
userbinator11 天前
<i>I had a 20 GiB JSON file of everything that has ever happened on Hacker News</i><p>I&#x27;m actually surprised at that volume, given this is a text-only site. Humans have managed to post <i>over 20 billion bytes</i> of text to it over the 18 years that HN existed? That averages to over 2MB per day, or around 7.5KB&#x2F;s.
评论 #43842525 未加载
评论 #43847241 未加载
评论 #43853451 未加载
评论 #43847099 未加载
评论 #43844980 未加载
评论 #43853791 未加载
SilverBirch11 天前
What is the netiquette of downloading HN? Do you ping Dang and ask him before you blow up his servers? Or do you just assume at this point that every billion dollar tech company is doing this many times over so you probably won&#x27;t even be noticed?
评论 #43843103 未加载
评论 #43848155 未加载
评论 #43842828 未加载
评论 #43846538 未加载
评论 #43844964 未加载
评论 #43845166 未加载
flakiness11 天前
I have done something similar. I cheated to use BigQuery dataset (which somehow keeps getting updated) and export the data to parquet, download it and query it using duckdb.
评论 #43841740 未加载
bambax11 天前
&gt; <i>Now that I have a local download of all Hacker News content, I can train hundreds of LLM-based bots on it and run them as contributors, slowly and inevitably replacing all human text with the output of a chinese room oscillator perpetually echoing and recycling the past.</i><p>The author said this in jest, but I fear someone, someday, will try this; I hope it never happens but if it does, could we stop it?
评论 #43843000 未加载
评论 #43842922 未加载
评论 #43842858 未加载
评论 #43844282 未加载
评论 #43843696 未加载
评论 #43844206 未加载
评论 #43842971 未加载
评论 #43847912 未加载
评论 #43842942 未加载
评论 #43846340 未加载
评论 #43842845 未加载
评论 #43844260 未加载
评论 #43843136 未加载
评论 #43845157 未加载
评论 #43843538 未加载
评论 #43842961 未加载
评论 #43851704 未加载
g8oz11 天前
I predict that in the coming years a lot of APIs will begin offer the option of just returning a duckdb file. If you&#x27;re just going to load the json into a database anyway, why not just get a database in the response.
评论 #43855632 未加载
stefs11 天前
please do not use stacked charts! i think it&#x27;s close to impossible to not to distort the readers impression because a) it&#x27;s very hard to gauge the height of a certain data point in the noise and b) they&#x27;re implying a dependency where there _probably_ is none.
评论 #43842867 未加载
评论 #43845394 未加载
评论 #43855973 未加载
评论 #43842817 未加载
ashish0112 天前
I wrote one a while back <a href="https:&#x2F;&#x2F;github.com&#x2F;ashish01&#x2F;hn-data-dumps">https:&#x2F;&#x2F;github.com&#x2F;ashish01&#x2F;hn-data-dumps</a> and it was a lot of fun. One thing which will be cool to implement is that more recent items will update more over time making any recent downloaded items more stale than older ones.
评论 #43841417 未加载
wslh11 天前
It would be great if it is available as a torrent. There also mutable torrents [1]. Not implemented everywhere but there are available ones [2].<p>[1] <a href="https:&#x2F;&#x2F;www.bittorrent.org&#x2F;beps&#x2F;bep_0046.html" rel="nofollow">https:&#x2F;&#x2F;www.bittorrent.org&#x2F;beps&#x2F;bep_0046.html</a><p>[2] <a href="https:&#x2F;&#x2F;www.npmjs.com&#x2F;package&#x2F;bittorrent-dht" rel="nofollow">https:&#x2F;&#x2F;www.npmjs.com&#x2F;package&#x2F;bittorrent-dht</a>
Am4TIfIsER0ppos11 天前
I hope they snatched my flagged comments. I would be pleased to have helped make the AI into an asshole. Here&#x27;s hoping for another Tay AI.
shayway11 天前
Hah, I&#x27;ve been scraping HN over the past couple weeks to do something similar! Only submissions though, not comments. It was after I went to &#x2F;newest and was faced with roughly 9&#x2F;10 posts being AI-related. I was curious what the actual percentage of posts on HN were about AI, and also how it compared to other things heavily hyped in the past like Web3 and crypto.
评论 #43846733 未加载
9rx11 天前
<i>&gt; The Rise Of Rust</i><p>Shouldn&#x27;t that be The Fall Of Rust? According to this, it saw the most attention during the years before it was created!
评论 #43842219 未加载
hsbauauvhabzb11 天前
Is the raw dataset available anywhere? I really don’t like the HN search function, and grepping through the data would be handy.
评论 #43842520 未加载
byearthithatius11 天前
Can you scrape all of HN by just incrementing item?id (since its sequential) and using Python web requests with IP rotation (in case there is rate limiting)?<p>NVM this approach of going item by item would take 460 days if the average request response time is 1 second (unless heavily parallelized, for instance 500 instances _could_ do it in a day but thats 40 million requests either way so would raise alarms).
deadbabe11 天前
Is the 20GB JSON file available?
matsemann11 天前
One thing I&#x27;m curious about, but I guess not visible in any way, is random stats about my own user&#x2F;usage of the site. What&#x27;s my upvote&#x2F;downvote ratio? Are there users I constantly upvote&#x2F;downvote? Who is liking&#x2F;hating my comments the most? And some I guessed could be scrapable: Which days&#x2F;times are I the most active (like the github green grid thingy)? How&#x27;s my activity changed over the years?
评论 #43843011 未加载
评论 #43841750 未加载
评论 #43842091 未加载
评论 #43845098 未加载
评论 #43842102 未加载
mike50311 天前
Other people have asked, probably for the same reason but I would love an offline version, packaged in zim format or something.<p>For when the apocalypse happens it’ll be enjoyable to read relatively high quality interactions and some of them may include useful post-apoc tidbits!
xnx11 天前
I have this data and a bunch of interesting analysis to share. Any suggestions on the best method to share results?<p>I like Tableau Public, because it allows for interactivity and exploration, but it can&#x27;t handle this many rows of data.<p>Is there a good tool for making charts directly from Clickhouse data?
评论 #43845657 未加载
tacker200011 天前
Yea, i also get the feeling that these rust evangelists get more annoying every day ;p
sebastianmestre11 天前
Can you remake the stacked graphs with the variable of interest at the bottom? Its hard to see the percentage of Rust when it&#x27;s all the way at the top with a lot of noise on the lower layers<p>Edit: or make a non-stacked version?
评论 #43845358 未加载
dredmorbius10 天前
I&#x27;ve been tempted to look into API-based HN access having scraped the front-page archive about two years ago.<p>One of the advantages of comments is that there&#x27;s simply <i>so much more text</i> to work with. For the front page, there is <i>up to</i> 80 characters of context (often deliberately obtuse), as well as metadata (date, story position, votes, site, submitter).<p>I&#x27;d initially embarked on the project to find out what cities were mentioned most often on HN (in front-page titles), though it turned out to be a much more interesting project than I&#x27;d anticipated.<p>(I&#x27;ve somewhat neglected it for a while though I&#x27;ll occasionally spin it up to check on questions or ideas.)
febeling10 天前
You wonder what all the Rust talk was about before the programming language&#x27;s release in Jan 2012.
评论 #43855029 未加载
andrewshadura11 天前
Funny nobody&#x27;s mentioned &quot;correct horse battery staple&quot; in the comments yet…
th1nhng011 天前
Can I ask how you draw the chart in the post?
评论 #43849338 未加载
pier2511 天前
would love to see the graph of React, Vue, Angular, and Svelte
评论 #43854309 未加载
a3w11 天前
Cool project. Cool graphs.<p>But any GDPR requests for info and deletion in your inbox, yet?
评论 #43848447 未加载