A friend of mine co-runs a semi-popular semi-niche news site (for now more than a decade), and complains that recently traffic rose with bots masquerading as humans.<p>How would they know? Well, because Google, in its omniscience, started to downrank them for faking views with bots (which they do not do): it shows bot percentage in traffic stats, and it skyrocketed relative to non-bot traffic (which is now less than 50%) as they started to fall from the front page (feeding the vicious circle). Presumably, Google does not know or care it is a bot when it serves ads, but correlates it later with the metrics it has from other sites that use GA or ads.<p>Or, perhaps, Google spots the same anomalies that my friend (an old school sysadmin who pays attention to logs) did, such as the increase of traffic along with never seen before popularity among iPhone users (who are so tech savvy that they apparently do not require CSS), or users from Dallas who famously love their QQBrowser. I’m not going to list all telltale signs as the crowd here is too hype on LLMs (which is our going theory so far, it is very timely), but my friend hopes Google learns them quickly.<p>These newcomers usually fake UA, use inconspicuous Western IPs (requests from Baidu/Tencent data center ranges do sign themselves as bots in UA), ignore robots.txt and load many pages very quickly.<p>I would assume bot traffic increase would apply to feeds, since they are of as much use for LLM training purposes.<p>My friend does not actually engage in stringent filtering like Rachel does, but I wonder how soon it becomes actually infeasible to operate a website <i>with actual original content</i> (which my friend co-writes) without either that or resorting to Cloudflare or the like for protection because of the domination of these creepy-crawlies.<p>Edit: Google already downranked them, not threatened to downrank. Also, traffic rose but did not skyrocket, but relative amount of bot traffic skyrocketed. (Presumably without downranking the traffic would actually skyrocket.)
Feed readers should be sending the If-Modified-Since header and web sites should properly recognize it and send the 304 Unmodified response. This isn’t new tech.
The HTTP protocol is a lost art. These days people don't even look at the status code and expect some mumbo jumbo JSON payload explaining the error.
I like Rachel's writing, but I don't understand this recent crusade against RSS readers. Sure, they should work properly and optimizations can be made to reduce bandwidth and processing power.<p>But... why not throw a CDN in front of your site and focus your energy somewhere else? I guess every problem has to be solved by someone, but this just seems like a very strange hill to die on.
This is why RSS for the birds.<p>My RSS reader YOShInOn subscribes to 110 RSS feeds through Superfeedr which absolves me of the responsibility of being on the other side of Rachel's problem.<p>With RSS you are always polling too fast or too slow; if you are polling too slow you might even miss items.<p>When a blog gets posted Superfeedr hits an AWS lambda function that stores the entry in SQS so my RSS reader can update itself at its own pace. The only trouble is Superfeedr costs 10 cents a feed per month which is a good deal for an active feed such as comments from Hacker News or article from <i>The Guardian</i> but is not affordable for subscribing to 2000+ indy blogs which YOShInOn could handle just fine.<p>I might yet write my own RSS head end, but there is something to say for protocols like ActivityPub and AT Protocol.
Rss is pretty light. Even if you say it's too much to be re-sending, you could remove the content from the rss feed (so they need to click through to read it), which would shrink the feed size massively. Alternatively, remove old posts. Or do both.<p>Hopefully you don't have some expensive code generating the feed on the fly, so processing overhead is negligible. But if it's not, cache the result and reset the cache every time you post.<p>Surely this is easier than spending the effort and emotional bandwidth to care about this issue?<p>I might be wrong here, but this feels more emotionally driven ("someone is wrong on the internet") than practical.
I am stupid, why not just return an HTML document explaining the issue, when there is such an incorrect second request in 20 minutes, then blocking that IP for 24 hours? The feed reader software author has to react, otherwise its users will complain to him, no?
On the flip side, what percent of RSS feed <i>generators</i> actually support conditional requests? I've written many over the last twenty years and I can tell you plainly, none of the ones I wrote have.<p>I never even considered the option or necessity. It's easy and cheap just to send everything.<p>I guess static generators with a apache style web server probably do, but I can't imagine any dynamic generators bother to try to save the small handful of bytes.
I have a blog where I post a few posts per year. [1] /feed.xml is served with an Expires header of 24 hours. I wrote a tool that allows me to query the webserver logs using SQLite [2]. Over the past 90 days, these are the top 10 requesters grouped by ip address (remote_addr column redacted here):<p><pre><code> requests_per_day user_agent
283 Reeder/5050001 CFNetwork/1568.300.101 Darwin/24.2.0
274 CommaFeed/4.4.0 (https://github.com/Athou/commafeed)
127 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
52 NetNewsWire (RSS Reader; https://netnewswire.com/)
47 Tiny Tiny RSS/23.04-0578bf80 (https://tt-rss.org/)
47 Refeed Reader/v1 (+https://www.refeed.dev/)
46 Selfoss/2.18 (SimplePie/1.5.1; +https://selfoss.aditu.de)
41 Reeder/5040601 CFNetwork/1568.100.1.1.1 Darwin/24.0.0
39 Tiny Tiny RSS/23.04 (Unsupported) (https://tt-rss.org/)
34 FreshRSS/1.24.3 (Linux; https://freshrss.org)
</code></pre>
Reeder is loading the feed every 5 minutes, and in the vast majority of cases it’s getting a 301 response because it tries to access the http version that redirects to https. At least it has state and it gets 304 Not Modified in the remaining cases.<p>If I order by body bytes served rather than number of requests (and group by remote_addr again), these are the worst consumers:<p><pre><code> body_megabytes_per_year user_agent
149.75943975 Refeed Reader/v1 (+https://www.refeed.dev/)
95.90771025 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
75.00080025 rss-parser
73.023702 Tiny Tiny RSS/24.09-0163884ef (Unsupported) (https://tt-rss.org/)
38.402385 Tiny Tiny RSS/24.11-42ebdb02 (https://tt-rss.org/)
37.984539 Selfoss/2.20-cf74581 (+https://selfoss.aditu.de)
30.3982965 NetNewsWire (RSS Reader; https://netnewswire.com/)
28.18013325 Tiny Tiny RSS/23.04-0578bf80 (https://tt-rss.org/)
26.330142 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36
24.838461 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36
</code></pre>
The top consumer, Refeed, is responsible for about 2.25% of all egress of my webserver. (Counting only body bytes, not http overhead.)<p>[1]: <a href="https://ruudvanasseldonk.com/writing" rel="nofollow">https://ruudvanasseldonk.com/writing</a>
[2]: <a href="https://github.com/ruuda/sqlog/blob/d129db35da9bbf95d8c2e97d575b4d5beb3bb40c/queries/feed_readers.sql">https://github.com/ruuda/sqlog/blob/d129db35da9bbf95d8c2e97d...</a>
I ban the feed for 24 hours if it doesnt work.<p>I also design 2 new formats that no one (including myself) has ever implemented.<p><a href="https://go-here.nl/ess-and-nno" rel="nofollow">https://go-here.nl/ess-and-nno</a><p>enjoy
I have a few feeds configured into Thunderbird but wasn’t reading them very often, so I “disabled” them to load manually. Despite this it tries to contact the sites often and, when not able to (firewall) goes into a frenzy of trying to contact them. All this despite being disabled.<p>Disappointing combined with the various update sites it tries to contact every startup, which is completely unnecessary as well. Couple of times a week should be the maximum rate.