TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Is there a Hacker News takeout to export my comments / upvotes, etc.?

45 pointsby thyroxover 1 year ago
Like the title says wondering if there is an equivalent of Google takeout for HN? Or how you guys are doing it?<p>Thanks.

8 comments

zX41ZdbWover 1 year ago
You can export the whole dataset as described here: <a href="https:&#x2F;&#x2F;github.com&#x2F;ClickHouse&#x2F;ClickHouse&#x2F;issues&#x2F;29693">https:&#x2F;&#x2F;github.com&#x2F;ClickHouse&#x2F;ClickHouse&#x2F;issues&#x2F;29693</a><p>Or query one of the preloaded datasets: <a href="https:&#x2F;&#x2F;play.clickhouse.com&#x2F;play?user=play#U0VMRUNUICogRlJPTSBoYWNrZXJuZXdzIFdIRVJFIGJ5ID0gJ3RoeXJveCcgT1JERVIgQlkgdGltZQ==" rel="nofollow noreferrer">https:&#x2F;&#x2F;play.clickhouse.com&#x2F;play?user=play#U0VMRUNUICogRlJPT...</a><p><pre><code> curl https:&#x2F;&#x2F;clickhouse.com&#x2F; | sh .&#x2F;clickhouse client --host play.clickhouse.com --user play --secure --query &quot;SELECT * FROM hackernews WHERE by = &#x27;thyrox&#x27; ORDER BY time&quot; --format JSON</code></pre>
评论 #38600674 未加载
Jugurthaover 1 year ago
Here&#x27;s a small, crude, Scrapy spider, with hardcoded values and all. You can set the value of `DOWNLOAD_DELAY` in `settings.py` for courtesy. It puts the comments in a `posts` directory as `html` files.<p>It doesn&#x27;t do upvotes nor stories&#x2F;links submitted (they have the type `story` in the response, as opposed to `text` for comments). You can easily tweak it.<p><pre><code> from pathlib import Path import scrapy import requests import html import json import os USER = &#x27;Jugurtha&#x27; LINKS = f&#x27;https:&#x2F;&#x2F;hacker-news.firebaseio.com&#x2F;v0&#x2F;user&#x2F;{USER}.json?print=pretty&#x27; BASE_URL = &#x27;https:&#x2F;&#x2F;hacker-news.firebaseio.com&#x2F;v0&#x2F;item&#x2F;&#x27; class HNSpider(scrapy.Spider): name = &quot;hn&quot; def start_requests(self): submitted = requests.get(LINKS).json()[&#x27;submitted&#x27;] urls = [f&#x27;{BASE_URL}{sub}.json?print=pretty&#x27; for sub in submitted] for url in urls: item = url.split(&#x27;&#x2F;item&#x2F;&#x27;)[1].split(&#x27;.json&#x27;)[0] filename = f&#x27;{item}.html&#x27; filepath = Path(f&#x27;posts&#x2F;{filename}&#x27;) if not os.path.exists(filepath): yield scrapy.Request(url=url, callback=self.parse) else: self.log(f&#x27;Skipping already downloaded {url}&#x27;) def parse(self, response): item = response.url.split(&#x27;&#x2F;item&#x2F;&#x27;)[1].split(&#x27;.json&#x27;)[0] filename = f&quot;{item}.html&quot; content = json.loads(response.text).get(&#x27;text&#x27;) if content is not None: text = html.unescape(content) filepath = Path(f&#x27;posts&#x2F;{filename}&#x27;) with open(Path(f&#x27;posts&#x2F;{filename}&#x27;), &#x27;w&#x27;) as f: f.write(text) self.log(f&quot;Saved file {filename}&quot;)</code></pre>
评论 #38605446 未加载
gabrielsrokaover 1 year ago
I wrote a JS one years ago. It still seems to work but it might need some more throttling.<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=34110624">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=34110624</a><p>Edit: I see I added a sleep on line 83 a few years ago.<p>Edit 2: I just fixed a big bug, I&#x27;m not sure if it was there before.<p>Edit 3: I wrote a Python one, too, but I haven&#x27;t tested it and it most likely needs to be throttled. It&#x27;s also not currently authenticated so only useful for certain pages unless you add authentication.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;gabrielsroka&#x2F;gabrielsroka.github.io&#x2F;blob&#x2F;master&#x2F;getHNFavorites.py">https:&#x2F;&#x2F;github.com&#x2F;gabrielsroka&#x2F;gabrielsroka.github.io&#x2F;blob&#x2F;...</a>
westurnerover 1 year ago
There are few tests for this script which isn&#x27;t packaged: <a href="https:&#x2F;&#x2F;github.com&#x2F;westurner&#x2F;dlhn&#x2F;">https:&#x2F;&#x2F;github.com&#x2F;westurner&#x2F;dlhn&#x2F;</a> <a href="https:&#x2F;&#x2F;github.com&#x2F;westurner&#x2F;dlhn&#x2F;tree&#x2F;master&#x2F;tests">https:&#x2F;&#x2F;github.com&#x2F;westurner&#x2F;dlhn&#x2F;tree&#x2F;master&#x2F;tests</a> <a href="https:&#x2F;&#x2F;github.com&#x2F;westurner&#x2F;hnlog&#x2F;blob&#x2F;master&#x2F;Makefile">https:&#x2F;&#x2F;github.com&#x2F;westurner&#x2F;hnlog&#x2F;blob&#x2F;master&#x2F;Makefile</a><p>Ctrl-F of the one document in a browser tab works, but isn&#x27;t regex search (or `grep -i -C`) without a browser extension.<p>Dogsheep &#x2F; datasette has a SQLite query Web UI<p>HackerNews&#x2F;API: <a href="https:&#x2F;&#x2F;github.com&#x2F;HackerNews&#x2F;API">https:&#x2F;&#x2F;github.com&#x2F;HackerNews&#x2F;API</a>
verdvermover 1 year ago
<a href="https:&#x2F;&#x2F;gist.github.com&#x2F;verdverm&#x2F;23aefb64ee981e17452e95dd5c491d26" rel="nofollow noreferrer">https:&#x2F;&#x2F;gist.github.com&#x2F;verdverm&#x2F;23aefb64ee981e17452e95dd5c4...</a><p>Fetches pages and then converts to json<p>There might be an HN API now. I know.theyve wanted one and I thought I might have seen posts more recently that made me think it now exists, but I haven&#x27;t looked for it myself
评论 #38599421 未加载
mooredsover 1 year ago
Nothing out of the box.<p>There&#x27;s a copy of the data in bigquery: <a href="https:&#x2F;&#x2F;console.cloud.google.com&#x2F;bigquery?p=bigquery-public-data&amp;d=hacker_news&amp;t=full&amp;page=table" rel="nofollow noreferrer">https:&#x2F;&#x2F;console.cloud.google.com&#x2F;bigquery?p=bigquery-public-...</a><p>But the latest post is from Nov 2022, not sure if&#x2F;when it gets reloaded.
Tomteover 1 year ago
Partially. <a href="https:&#x2F;&#x2F;github.com&#x2F;dogsheep&#x2F;hacker-news-to-sqlite">https:&#x2F;&#x2F;github.com&#x2F;dogsheep&#x2F;hacker-news-to-sqlite</a>
082349872349872over 1 year ago
scrape <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;user?id=thyrox">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;user?id=thyrox</a> ?