As someone who has been aggressively cataloging "data" (posts, comments, subreddits, etc.) from Reddit and, importantly in this context, keeping those records relatively up-to-date, it's absolutely astonishing how much spam there is.<p>I hash every string with a SimHash and perform a Hamming distance query against those hashes for any hash that belongs to more than 3 accounts, i.e., any full string (> 42 characters) which was posted as a post title, post body, comment body, or account "description" by more than 3 accounts.<p>Regularly, this exposes huge networks of both fresh accounts and what I have to assume are stolen, credentialed "aged" accounts being used to spam that just recycle the same or very similar (Hamming distance < 5 on strings > 42 characters) titles/bodies. We're talking thousands of accounts over months just posting the same content over and over to the same range of subreddits.<p>I'm just some random Laravel enjoyer, and I've automated the 'banning' of these accounts (really, I flag the strings, and any account that posts them is then flagged).<p>This doesn't even touch on the media... (I've basically done the same thing with hashing the media to detect duplicate or very, very similar content via pHash). Thousands and thousands of accounts are spamming the same images over and over and over.<p>From my numbers, 59% of the content on Reddit is spam, and 51% of the accounts are spam, and that's not including the media-flagged spammers.<p>They don't seem to care about the spam, or they're completely inept. With the resources at their disposal, there's such a huge portion of this that should be able to be moderated before it ever reaches the API/live.