TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Reddit wins? no current and complete Reddit Corpus dataset exists

10 点作者 moonshotideas将近 2 年前
Is it me or did Reddit just win? I don’t think a full dataset of all posts and comments exists other than the 2018 one, and now that a large chunk of Reddit’s gone dark that’s ensured no one can scrape one together, how is this not all playing right into Reddit’s hands?

5 条评论

jakabia将近 2 年前
The whole reddit (posts and comments separately) from 2005-06 until 2022-12 is on this [1] torrent link, it&#x27;s very easy to download, extract and use the data [2]. I&#x27;m writing my thesis about the connection between the reddit post&#x27;s type and the comment structure, and I&#x27;ve been working with this data, for a few months, it&#x27;s amazing.<p>[1] <a href="https:&#x2F;&#x2F;academictorrents.com&#x2F;details&#x2F;7c0645c94321311bb05bd879ddee4d0eba08aaee" rel="nofollow noreferrer">https:&#x2F;&#x2F;academictorrents.com&#x2F;details&#x2F;7c0645c94321311bb05bd87...</a><p>[2] <a href="https:&#x2F;&#x2F;github.com&#x2F;Watchful1&#x2F;PushshiftDumps">https:&#x2F;&#x2F;github.com&#x2F;Watchful1&#x2F;PushshiftDumps</a>
评论 #36338217 未加载
seydor将近 2 年前
It is so interesting to consider that reddit, the unprofitable startup that everyone loves to mock managed to have the most valuable dataset among all social media. I am certain that despite facebook being 10 times its size, their comments are no match for the information contained in reddit&#x27;s comments. Why did this happen? greed i guess. Popular doesnt mean valuable.<p>In any case i am rooting for Reddit to win big, but i dont see them having a plan. Their website is stuck in &#x27;00s norms while the world has moved forward. Now a clique of moderators take over the site, and reddit doesn&#x27;t seem to do anything about it. So many lost opportunities
pyeri将近 2 年前
You&#x27;re right. Call me a pessimistic tinfoil hat but there is a good chance that Reddit Inc. is playing both sides here. This has been a contemporary tactic employed by many authoritarians of our times recently, a tactic I&#x27;d like to call &quot;giving your enemy that extra rope to hang itself&quot;.<p>It&#x27;s no wonder that public sympathy is strongly shifting towards the side of spez and Reddit Inc. after all the major subs went dark all of a sudden. The concept of &quot;indefinite blackout&quot; was problematic to begin with. Reddit black outs had happened earlier too when net neutrality was in danger or freedom curbing laws were being passed, it used to be just for a day or two to garner attention.<p>The impression netizens are getting right now is that these &quot;rogue mods&quot; have just hijacked the sub-reddits and disappeared, thus bringing the whole conversations and ecosystem to a standstill. How exactly is this perception not working in favor of Reddit and spez? As I said, <i>giving your enemy the extra rope to hang itself!</i>
评论 #36338897 未加载
kratom_sandwich将近 2 年前
IIRC, the Archive Team asked for help on &#x2F;r&#x2F;DataHoarder and had saved 10 bn posts to later upload them on the Internet Archive ... that might yield something? I personally have never heard of them before, but that doesn&#x27;t mean anything ...<p><a href="https:&#x2F;&#x2F;wiki.archiveteam.org&#x2F;index.php&#x2F;Reddit#Project_details" rel="nofollow noreferrer">https:&#x2F;&#x2F;wiki.archiveteam.org&#x2F;index.php&#x2F;Reddit#Project_detail...</a>
firsal_ha将近 2 年前
I just did a couple of searches, and i couldn&#x27;t find any, does anyone else know of a full reddit dataset, this cant be true?