TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Reddit wins? no current and complete Reddit Corpus dataset exists

10 pointsby moonshotideasalmost 2 years ago
Is it me or did Reddit just win? I don’t think a full dataset of all posts and comments exists other than the 2018 one, and now that a large chunk of Reddit’s gone dark that’s ensured no one can scrape one together, how is this not all playing right into Reddit’s hands?

5 comments

jakabiaalmost 2 years ago
The whole reddit (posts and comments separately) from 2005-06 until 2022-12 is on this [1] torrent link, it&#x27;s very easy to download, extract and use the data [2]. I&#x27;m writing my thesis about the connection between the reddit post&#x27;s type and the comment structure, and I&#x27;ve been working with this data, for a few months, it&#x27;s amazing.<p>[1] <a href="https:&#x2F;&#x2F;academictorrents.com&#x2F;details&#x2F;7c0645c94321311bb05bd879ddee4d0eba08aaee" rel="nofollow noreferrer">https:&#x2F;&#x2F;academictorrents.com&#x2F;details&#x2F;7c0645c94321311bb05bd87...</a><p>[2] <a href="https:&#x2F;&#x2F;github.com&#x2F;Watchful1&#x2F;PushshiftDumps">https:&#x2F;&#x2F;github.com&#x2F;Watchful1&#x2F;PushshiftDumps</a>
评论 #36338217 未加载
seydoralmost 2 years ago
It is so interesting to consider that reddit, the unprofitable startup that everyone loves to mock managed to have the most valuable dataset among all social media. I am certain that despite facebook being 10 times its size, their comments are no match for the information contained in reddit&#x27;s comments. Why did this happen? greed i guess. Popular doesnt mean valuable.<p>In any case i am rooting for Reddit to win big, but i dont see them having a plan. Their website is stuck in &#x27;00s norms while the world has moved forward. Now a clique of moderators take over the site, and reddit doesn&#x27;t seem to do anything about it. So many lost opportunities
pyerialmost 2 years ago
You&#x27;re right. Call me a pessimistic tinfoil hat but there is a good chance that Reddit Inc. is playing both sides here. This has been a contemporary tactic employed by many authoritarians of our times recently, a tactic I&#x27;d like to call &quot;giving your enemy that extra rope to hang itself&quot;.<p>It&#x27;s no wonder that public sympathy is strongly shifting towards the side of spez and Reddit Inc. after all the major subs went dark all of a sudden. The concept of &quot;indefinite blackout&quot; was problematic to begin with. Reddit black outs had happened earlier too when net neutrality was in danger or freedom curbing laws were being passed, it used to be just for a day or two to garner attention.<p>The impression netizens are getting right now is that these &quot;rogue mods&quot; have just hijacked the sub-reddits and disappeared, thus bringing the whole conversations and ecosystem to a standstill. How exactly is this perception not working in favor of Reddit and spez? As I said, <i>giving your enemy the extra rope to hang itself!</i>
评论 #36338897 未加载
kratom_sandwichalmost 2 years ago
IIRC, the Archive Team asked for help on &#x2F;r&#x2F;DataHoarder and had saved 10 bn posts to later upload them on the Internet Archive ... that might yield something? I personally have never heard of them before, but that doesn&#x27;t mean anything ...<p><a href="https:&#x2F;&#x2F;wiki.archiveteam.org&#x2F;index.php&#x2F;Reddit#Project_details" rel="nofollow noreferrer">https:&#x2F;&#x2F;wiki.archiveteam.org&#x2F;index.php&#x2F;Reddit#Project_detail...</a>
firsal_haalmost 2 years ago
I just did a couple of searches, and i couldn&#x27;t find any, does anyone else know of a full reddit dataset, this cant be true?