TechEcho

8 comments

zX41ZdbWover 1 year ago

You can export the whole dataset as described here: <a href="https://github.com/ClickHouse/ClickHouse/issues/29693">https://github.com/ClickHouse/ClickHouse/issues/29693</a>Or query one of the preloaded datasets: <a href="https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPTSBoYWNrZXJuZXdzIFdIRVJFIGJ5ID0gJ3RoeXJveCcgT1JERVIgQlkgdGltZQ==" rel="nofollow noreferrer">https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...</a><pre><code> curl https://clickhouse.com/ | sh ./clickhouse client --host play.clickhouse.com --user play --secure --query "SELECT * FROM hackernews WHERE by = 'thyrox' ORDER BY time" --format JSON</code></pre>

评论 #38600674 未加载

Jugurthaover 1 year ago

Here's a small, crude, Scrapy spider, with hardcoded values and all. You can set the value of `DOWNLOAD_DELAY` in `settings.py` for courtesy. It puts the comments in a `posts` directory as `html` files.It doesn't do upvotes nor stories/links submitted (they have the type `story` in the response, as opposed to `text` for comments). You can easily tweak it.<pre><code> from pathlib import Path import scrapy import requests import html import json import os USER = 'Jugurtha' LINKS = f'https://hacker-news.firebaseio.com/v0/user/{USER}.json?print=pretty' BASE_URL = 'https://hacker-news.firebaseio.com/v0/item/' class HNSpider(scrapy.Spider): name = "hn" def start_requests(self): submitted = requests.get(LINKS).json()['submitted'] urls = [f'{BASE_URL}{sub}.json?print=pretty' for sub in submitted] for url in urls: item = url.split('/item/')[1].split('.json')[0] filename = f'{item}.html' filepath = Path(f'posts/{filename}') if not os.path.exists(filepath): yield scrapy.Request(url=url, callback=self.parse) else: self.log(f'Skipping already downloaded {url}') def parse(self, response): item = response.url.split('/item/')[1].split('.json')[0] filename = f"{item}.html" content = json.loads(response.text).get('text') if content is not None: text = html.unescape(content) filepath = Path(f'posts/{filename}') with open(Path(f'posts/{filename}'), 'w') as f: f.write(text) self.log(f"Saved file {filename}")</code></pre>

评论 #38605446 未加载

gabrielsrokaover 1 year ago

I wrote a JS one years ago. It still seems to work but it might need some more throttling.<a href="https://news.ycombinator.com/item?id=34110624">https://news.ycombinator.com/item?id=34110624</a>Edit: I see I added a sleep on line 83 a few years ago.Edit 2: I just fixed a big bug, I'm not sure if it was there before.Edit 3: I wrote a Python one, too, but I haven't tested it and it most likely needs to be throttled. It's also not currently authenticated so only useful for certain pages unless you add authentication.<a href="https://github.com/gabrielsroka/gabrielsroka.github.io/blob/master/getHNFavorites.py">https://github.com/gabrielsroka/gabrielsroka.github.io/blob/...</a>

westurnerover 1 year ago

There are few tests for this script which isn't packaged: <a href="https://github.com/westurner/dlhn/">https://github.com/westurner/dlhn/</a> <a href="https://github.com/westurner/dlhn/tree/master/tests">https://github.com/westurner/dlhn/tree/master/tests</a> <a href="https://github.com/westurner/hnlog/blob/master/Makefile">https://github.com/westurner/hnlog/blob/master/Makefile</a>Ctrl-F of the one document in a browser tab works, but isn't regex search (or `grep -i -C`) without a browser extension.Dogsheep / datasette has a SQLite query Web UIHackerNews/API: <a href="https://github.com/HackerNews/API">https://github.com/HackerNews/API</a>

verdvermover 1 year ago

<a href="https://gist.github.com/verdverm/23aefb64ee981e17452e95dd5c491d26" rel="nofollow noreferrer">https://gist.github.com/verdverm/23aefb64ee981e17452e95dd5c4...</a>Fetches pages and then converts to jsonThere might be an HN API now. I know.theyve wanted one and I thought I might have seen posts more recently that made me think it now exists, but I haven't looked for it myself

评论 #38599421 未加载

mooredsover 1 year ago

Nothing out of the box.There's a copy of the data in bigquery: <a href="https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=hacker_news&t=full&page=table" rel="nofollow noreferrer">https://console.cloud.google.com/bigquery?p=bigquery-public-...</a>But the latest post is from Nov 2022, not sure if/when it gets reloaded.

Tomteover 1 year ago

Partially. <a href="https://github.com/dogsheep/hacker-news-to-sqlite">https://github.com/dogsheep/hacker-news-to-sqlite</a>

082349872349872over 1 year ago

scrape <a href="https://news.ycombinator.com/user?id=thyrox">https://news.ycombinator.com/user?id=thyrox</a> ?

8 comments

zX41ZdbWover 1 year ago

评论 #38600674 未加载

Jugurthaover 1 year ago

评论 #38605446 未加载

gabrielsrokaover 1 year ago

westurnerover 1 year ago

verdvermover 1 year ago

评论 #38599421 未加载

mooredsover 1 year ago

Tomteover 1 year ago

Partially. <a href="https://github.com/dogsheep/hacker-news-to-sqlite">https://github.com/dogsheep/hacker-news-to-sqlite</a>

082349872349872over 1 year ago

scrape <a href="https://news.ycombinator.com/user?id=thyrox">https://news.ycombinator.com/user?id=thyrox</a> ?

Ask HN: Is there a Hacker News takeout to export my comments / upvotes, etc.?

8 comments

Ask HN: Is there a Hacker News takeout to export my comments / upvotes, etc.?

8 comments