TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Is anyone working on a Reddit archive?

22 pointsby nidnoggalmost 2 years ago
Although I disagree with most of the criticisms leveraged towards the platform, I still readily depend on it for a lot of day-to-day resources and general questions, many times less technically oriented.<p>While I wouldn&#x27;t mind losing the system there in the long run, I think the state of posts before this upheaval was very valuable as a reference.<p>Like the title says - has anyone done anything like a Reddit &quot;takeout&quot; yet?

4 comments

uniqueuidalmost 2 years ago
Yes.<p>There is the pushift dataset covering posts and comments through 2022 [1].<p>And the ArchiveTeam has begun crawling reddit as well some time ago [2]<p>[1] <a href="https:&#x2F;&#x2F;old.reddit.com&#x2F;r&#x2F;pushshift&#x2F;comments&#x2F;10bwxke&#x2F;updated_torrent_of_dump_files_through_december&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;old.reddit.com&#x2F;r&#x2F;pushshift&#x2F;comments&#x2F;10bwxke&#x2F;updated_...</a><p>[2] <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=36254172">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=36254172</a>
cookiengineeralmost 2 years ago
I was focussing mostly on cyber security related subreddits because the vulnerability and exploit discussions were of great value to me.<p>I built a little scraper in golang that stores the JSON data (instead of the HTML which the archive warrior stores) to save hdd storage. [1]<p>The problem with reddit&#x27;s API is that it only shows 1000 entries over 10 pages in every api. Meaning hot&#x2F;top&#x2F;new, and search results are limited. If you have more links related to the keyword, you won&#x27;t discover more.<p>So you need a very specific keyword list to be able to discover more posts, and search each subreddit for each entry in the keyword list.<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;cookiengineer&#x2F;reddit-archivar">https:&#x2F;&#x2F;github.com&#x2F;cookiengineer&#x2F;reddit-archivar</a>
minimaxiralmost 2 years ago
Pushshift was the Reddit archive but apparently recent agreements with Reddit may have changed that.<p>Anyone else creating a Reddit archive will likely get a C&amp;D.
评论 #36541685 未加载
simonblackalmost 2 years ago
It&#x27;s next on my list after I finish the MySpace archive.<p>Seriously, why would anybody do this? Reddit has such a high noise-to-signal ratio that it would be a waste of resources. There may be value in keeping an archive of some individual subreddits, but not the main bulk of Reddit itself.
评论 #36548857 未加载