How F5Bot Slurps All of Reddit

255 pointsby foobalmost 7 years ago

16 comments

jwilkalmost 7 years ago

> The other 95% of it is just wasted bandwidth.You can save a lot of bandwith by requesting compressed responses:<pre><code> $ curl -s --user-agent moo/1 -H 'Accept-Encoding: gzip' "$pretty_long_url" > test.gz $ wc -c < test.gz 63507 $ gzip -d < test.gz | wc -c 426941 </code></pre> (OK, that's 85% saved, not 95%, but hey.)

评论 #17648974 未加载

333calmost 7 years ago

> I mean do you really want subreddit name and subreddit_name_prefixed? They’re the same, one just has an “r/” in front of it.This is (unfortunately) not quite true. Since Reddit introduced "profile posts," there can be a post where the subreddit name is something like "u_Shitty_Watercolour" but the subreddit_name_prefixed is actually "u/Shitty_Watercolour", rather than "r/u_Shitty_Watercolour".Example: <a href="https://www.reddit.com/user/Shitty_Watercolour/comments/84nhwi/here_is_my_patreon_if_you_would_like_to_support/.json" rel="nofollow">https://www.reddit.com/user/Shitty_Watercolour/comments/84nh...</a>

评论 #17648208 未加载

saurikalmost 7 years ago

It is difficult for me to describe just how angry it makes me that reddit doesn't provide a way for users to even do basic things like "see all of my own comments" or "see all of the posts made to the subreddit I moderate". They keep nerfing the search APIs and claim it is so they could make the indexes more efficient, but while that might make sense for a full-text search interface, that is entirely unreasonable for basic functionality like "I'm scrolling back through time on my own user page" (where the efficient index is pretty obvious). Both of "see all of the content I posted" and "see all of the content I'm supposedly responsible for" seems like it should be basic, if not required, functionality for any website.<a href="https://www.reddit.com/r/changelog/comments/7tus5f/update_to_search_api/" rel="nofollow">https://www.reddit.com/r/changelog/comments/7tus5f/update_to...</a><a href="https://www.reddit.com/r/redditdev/comments/7qpn0h/how_to_retrieve_all_removed_posts_via_api/" rel="nofollow">https://www.reddit.com/r/redditdev/comments/7qpn0h/how_to_re...</a><a href="https://www.reddit.com/r/help/comments/1u0scj/get_full_post_history/" rel="nofollow">https://www.reddit.com/r/help/comments/1u0scj/get_full_post_...</a>

评论 #17648624 未加载

评论 #17647915 未加载

评论 #17648488 未加载

评论 #17647990 未加载

评论 #17648201 未加载

wolcoalmost 7 years ago

"You may think PHP is slow"Why would we think php is slow? PHP is blazing fast certain applications (looking at you sugarcrm) make this into a mockery by rewriting queries and loading unnecessary data into each page request.Nice to see a php related show and tell.

评论 #17647948 未加载

评论 #17648095 未加载

allenzalmost 7 years ago

Related: Jason Baumgartner has maintained a Reddit scraping pipeline for a few years now, and wrote up some notes about making it robust: <a href="https://pushshift.io" rel="nofollow">https://pushshift.io</a>

评论 #17647872 未加载

saagarjhaalmost 7 years ago

Aho-Corasick is really great. It’s a bit complicated to set up, but once you have the modified true set up it’s really fast. By the way,> Basically I use the selftext, subreddit, permalink, url and title. The other 95% of it is just wasted bandwidth.It’d probably be better for Reddit if they allowed for specifying the fields we care about rather than just returning the whole thing…

评论 #17648081 未加载

评论 #17652741 未加载

评论 #17648251 未加载

stevebmarkalmost 7 years ago

This is just scraping JSON, I'm surprised it made it to the front page. The only thing worth noting is that Reddit is is able to serve that much JSON

评论 #17661747 未加载

JeremiahMNalmost 7 years ago

I made something just like this that worked on forums. Basically any forum that was using the tapatalk plugin (pretty much any busy forum uses it these days) you could subscribe too. It doesn't look like this will handle mispellings of works, or anything like that. I was handling that, however it took a LOT of processing power and quickly realized that the more people used it, the more it wasn't going to scale really well. Good luck with your project.

评论 #17648185 未加载

dev_dullalmost 7 years ago

> So here’s the approach I ended up using, which worked much better: request each post by its ID. That’s right, instead of asking for posts in batches of 100, we’re going to need to ask for each post individually by its post ID. We’ll do the same for comments.Seems a bit over the top imho. Maybe a better approach is to ask for a 1,000 and look for any missing — which you can grab individually.I’d be a little annoyed at people not using batch mode and making so many request but that’s just me.

评论 #17648999 未加载

评论 #17649519 未加载

visargaalmost 7 years ago

There's a reddit database dump including the interval 2005 - 05/2018 at:<a href="https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2015_05" rel="nofollow">https://bigquery.cloud.google.com/table/fh-bigquery:reddit_c...</a>

testplzignorealmost 7 years ago

Which API do most Reddit bots use? Do they use the Reddit APIs directly, or do they use one of the third-party services (F5Bot, pushshift)? And are there any other options for getting a firehose of new Reddit posts/comments?

评论 #17652125 未加载

kierenjalmost 7 years ago

Do the social share buttons literally cover the first few paragraphs of content for anyone else?

wingerlangalmost 7 years ago

For the service itself, I've been using it for a long time and it works really well.

评论 #17652756 未加载

textmodealmost 7 years ago

"Turns out that Reddit [API] has a limit. It'll only show you 100 posts at a time."100 sounds like a typical "max-requests" pipelining limit.He does not mention CURLMOPT_PIPELINING.Does this mean he makes 100 TCP connections in order to make 100 HTTP requests?

评论 #17648663 未加载

RSZCalmost 7 years ago

edit: cool

评论 #17649547 未加载

评论 #17647846 未加载

评论 #17647860 未加载

prolikewh0aalmost 7 years ago

This is mostly why I left Reddit. The API allows far too much control and I started questioning what was even real. Being able to quickly find keywords and then have a network of bots that creates replies/upvotes/downvotes is very disturbing thought to me. I can't even imagine something like that on a large scale to change opinions.

评论 #17648879 未加载

评论 #17650416 未加载

评论 #17651731 未加载

评论 #17649426 未加载