> The other 95% of it is just wasted bandwidth.<p>You can save a lot of bandwith by requesting compressed responses:<p><pre><code> $ curl -s --user-agent moo/1 -H 'Accept-Encoding: gzip' "$pretty_long_url" > test.gz
$ wc -c < test.gz
63507
$ gzip -d < test.gz | wc -c
426941
</code></pre>
(OK, that's 85% saved, not 95%, but hey.)
> I mean do you really want subreddit name and subreddit_name_prefixed? They’re the same, one just has an “r/” in front of it.<p>This is (unfortunately) not quite true. Since Reddit introduced "profile posts," there can be a post where the subreddit name is something like "u_Shitty_Watercolour" but the subreddit_name_prefixed is actually "u/Shitty_Watercolour", rather than "r/u_Shitty_Watercolour".<p>Example: <a href="https://www.reddit.com/user/Shitty_Watercolour/comments/84nhwi/here_is_my_patreon_if_you_would_like_to_support/.json" rel="nofollow">https://www.reddit.com/user/Shitty_Watercolour/comments/84nh...</a>
It is difficult for me to describe just how angry it makes me that reddit doesn't provide a way for users to even do basic things like "see all of my own comments" or "see all of the posts made to the subreddit I moderate". They keep nerfing the search APIs and claim it is so they could make the indexes more efficient, but while that might make sense for a full-text search interface, that is entirely unreasonable for basic functionality like "I'm scrolling back through time on my own user page" (where the efficient index is pretty obvious). Both of "see all of the content I posted" and "see all of the content I'm supposedly responsible for" seems like it should be basic, if not required, functionality for any website.<p><a href="https://www.reddit.com/r/changelog/comments/7tus5f/update_to_search_api/" rel="nofollow">https://www.reddit.com/r/changelog/comments/7tus5f/update_to...</a><p><a href="https://www.reddit.com/r/redditdev/comments/7qpn0h/how_to_retrieve_all_removed_posts_via_api/" rel="nofollow">https://www.reddit.com/r/redditdev/comments/7qpn0h/how_to_re...</a><p><a href="https://www.reddit.com/r/help/comments/1u0scj/get_full_post_history/" rel="nofollow">https://www.reddit.com/r/help/comments/1u0scj/get_full_post_...</a>
"You may think PHP is slow"<p>Why would we think php is slow? PHP is blazing fast certain applications (looking at you sugarcrm) make this into a mockery by rewriting queries and loading unnecessary data into each page request.<p>Nice to see a php related show and tell.
Related: Jason Baumgartner has maintained a Reddit scraping pipeline for a few years now, and wrote up some notes about making it robust: <a href="https://pushshift.io" rel="nofollow">https://pushshift.io</a>
Aho-Corasick is really great. It’s a bit complicated to set up, but once you have the modified true set up it’s really fast. By the way,<p>> Basically I use the selftext, subreddit, permalink, url and title. The other 95% of it is just wasted bandwidth.<p>It’d probably be better for Reddit if they allowed for specifying the fields we care about rather than just returning the whole thing…
This is just scraping JSON, I'm surprised it made it to the front page. The only thing worth noting is that Reddit is is able to <i>serve</i> that much JSON
I made something just like this that worked on forums. Basically any forum that was using the tapatalk plugin (pretty much any busy forum uses it these days) you could subscribe too. It doesn't look like this will handle mispellings of works, or anything like that. I was handling that, however it took a LOT of processing power and quickly realized that the more people used it, the more it wasn't going to scale really well. Good luck with your project.
> <i>So here’s the approach I ended up using, which worked much better: request each post by its ID. That’s right, instead of asking for posts in batches of 100, we’re going to need to ask for each post individually by its post ID. We’ll do the same for comments.</i><p>Seems a bit over the top imho. Maybe a better approach is to ask for a 1,000 and look for any missing — which you can grab individually.<p>I’d be a little annoyed at people not using batch mode and making so many request but that’s just me.
There's a reddit database dump including the interval 2005 - 05/2018 at:<p><a href="https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2015_05" rel="nofollow">https://bigquery.cloud.google.com/table/fh-bigquery:reddit_c...</a>
Which API do most Reddit bots use? Do they use the Reddit APIs directly, or do they use one of the third-party services (F5Bot, pushshift)? And are there any other options for getting a firehose of new Reddit posts/comments?
"Turns out that Reddit [API] has a limit. It'll only show you 100 posts at a time."<p>100 sounds like a typical "max-requests" pipelining limit.<p>He does not mention CURLMOPT_PIPELINING.<p>Does this mean he makes 100 TCP connections in order to make 100 HTTP requests?
This is mostly why I left Reddit. The API allows far too much control and I started questioning what was even real. Being able to quickly find keywords and then have a network of bots that creates replies/upvotes/downvotes is very disturbing thought to me. I can't even imagine something like that on a large scale to change opinions.