Feed readers which don't take "no" for an answer

236 pointsby kencausey5 months ago

19 comments

A friend of mine co-runs a semi-popular semi-niche news site (for now more than a decade), and complains that recently traffic rose with bots masquerading as humans.How would they know? Well, because Google, in its omniscience, started to downrank them for faking views with bots (which they do not do): it shows bot percentage in traffic stats, and it skyrocketed relative to non-bot traffic (which is now less than 50%) as they started to fall from the front page (feeding the vicious circle). Presumably, Google does not know or care it is a bot when it serves ads, but correlates it later with the metrics it has from other sites that use GA or ads.Or, perhaps, Google spots the same anomalies that my friend (an old school sysadmin who pays attention to logs) did, such as the increase of traffic along with never seen before popularity among iPhone users (who are so tech savvy that they apparently do not require CSS), or users from Dallas who famously love their QQBrowser. I’m not going to list all telltale signs as the crowd here is too hype on LLMs (which is our going theory so far, it is very timely), but my friend hopes Google learns them quickly.These newcomers usually fake UA, use inconspicuous Western IPs (requests from Baidu/Tencent data center ranges do sign themselves as bots in UA), ignore robots.txt and load many pages very quickly.I would assume bot traffic increase would apply to feeds, since they are of as much use for LLM training purposes.My friend does not actually engage in stringent filtering like Rachel does, but I wonder how soon it becomes actually infeasible to operate a website with actual original content (which my friend co-writes) without either that or resorting to Cloudflare or the like for protection because of the domination of these creepy-crawlies.Edit: Google already downranked them, not threatened to downrank. Also, traffic rose but did not skyrocket, but relative amount of bot traffic skyrocketed. (Presumably without downranking the traffic would actually skyrocket.)

评论 #42487167 未加载

评论 #42487986 未加载

评论 #42486947 未加载

评论 #42487410 未加载

评论 #42493243 未加载

评论 #42494905 未加载

评论 #42489924 未加载

评论 #42490900 未加载

Apreche5 months ago

Feed readers should be sending the If-Modified-Since header and web sites should properly recognize it and send the 304 Unmodified response. This isn’t new tech.

评论 #42486380 未加载

评论 #42486367 未加载

评论 #42491592 未加载

Havoc5 months ago

Blocked for 2 hits in 20 minutes on a light protocol like rss?That seems hilariously aggressive to me, but her server her rules I guess.

评论 #42486882 未加载

评论 #42486799 未加载

评论 #42486876 未加载

评论 #42486784 未加载

评论 #42486878 未加载

评论 #42486807 未加载

jannes5 months ago

The HTTP protocol is a lost art. These days people don't even look at the status code and expect some mumbo jumbo JSON payload explaining the error.

评论 #42486341 未加载

评论 #42486626 未加载

评论 #42486563 未加载

评论 #42486804 未加载

shepherdjerred5 months ago

I like Rachel's writing, but I don't understand this recent crusade against RSS readers. Sure, they should work properly and optimizations can be made to reduce bandwidth and processing power.But... why not throw a CDN in front of your site and focus your energy somewhere else? I guess every problem has to be solved by someone, but this just seems like a very strange hill to die on.

评论 #42491237 未加载

评论 #42493031 未加载

评论 #42492607 未加载

generationP5 months ago

Rejecting every unconditional GET after the first? That sounds a bit excessive. What if the reader crashed after the first and lost the data?

评论 #42486604 未加载

评论 #42488814 未加载

bombcar5 months ago

At some point instead of 429 it should return a feed with this post as always newest.

评论 #42491334 未加载

评论 #42493455 未加载

PaulHoule5 months ago

This is why RSS for the birds.My RSS reader YOShInOn subscribes to 110 RSS feeds through Superfeedr which absolves me of the responsibility of being on the other side of Rachel's problem.With RSS you are always polling too fast or too slow; if you are polling too slow you might even miss items.When a blog gets posted Superfeedr hits an AWS lambda function that stores the entry in SQS so my RSS reader can update itself at its own pace. The only trouble is Superfeedr costs 10 cents a feed per month which is a good deal for an active feed such as comments from Hacker News or article from The Guardian but is not affordable for subscribing to 2000+ indy blogs which YOShInOn could handle just fine.I might yet write my own RSS head end, but there is something to say for protocols like ActivityPub and AT Protocol.

评论 #42489649 未加载

wheybags5 months ago

Rss is pretty light. Even if you say it's too much to be re-sending, you could remove the content from the rss feed (so they need to click through to read it), which would shrink the feed size massively. Alternatively, remove old posts. Or do both.Hopefully you don't have some expensive code generating the feed on the fly, so processing overhead is negligible. But if it's not, cache the result and reset the cache every time you post.Surely this is easier than spending the effort and emotional bandwidth to care about this issue?I might be wrong here, but this feels more emotionally driven ("someone is wrong on the internet") than practical.

评论 #42487553 未加载

RA2lover5 months ago

Related: <a href="https://news.ycombinator.com/item?id=42470035">https://news.ycombinator.com/item?id=42470035</a>

nilslindemann5 months ago

I am stupid, why not just return an HTML document explaining the issue, when there is such an incorrect second request in 20 minutes, then blocking that IP for 24 hours? The feed reader software author has to react, otherwise its users will complain to him, no?

评论 #42487137 未加载

评论 #42487030 未加载

donatj5 months ago

On the flip side, what percent of RSS feed generators actually support conditional requests? I've written many over the last twenty years and I can tell you plainly, none of the ones I wrote have.I never even considered the option or necessity. It's easy and cheap just to send everything.I guess static generators with a apache style web server probably do, but I can't imagine any dynamic generators bother to try to save the small handful of bytes.

评论 #42489809 未加载

ruuda5 months ago

I have a blog where I post a few posts per year. [1] /feed.xml is served with an Expires header of 24 hours. I wrote a tool that allows me to query the webserver logs using SQLite [2]. Over the past 90 days, these are the top 10 requesters grouped by ip address (remote_addr column redacted here):<pre><code> requests_per_day user_agent 283 Reeder/5050001 CFNetwork/1568.300.101 Darwin/24.2.0 274 CommaFeed/4.4.0 (https://github.com/Athou/commafeed) 127 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 52 NetNewsWire (RSS Reader; https://netnewswire.com/) 47 Tiny Tiny RSS/23.04-0578bf80 (https://tt-rss.org/) 47 Refeed Reader/v1 (+https://www.refeed.dev/) 46 Selfoss/2.18 (SimplePie/1.5.1; +https://selfoss.aditu.de) 41 Reeder/5040601 CFNetwork/1568.100.1.1.1 Darwin/24.0.0 39 Tiny Tiny RSS/23.04 (Unsupported) (https://tt-rss.org/) 34 FreshRSS/1.24.3 (Linux; https://freshrss.org) </code></pre> Reeder is loading the feed every 5 minutes, and in the vast majority of cases it’s getting a 301 response because it tries to access the http version that redirects to https. At least it has state and it gets 304 Not Modified in the remaining cases.If I order by body bytes served rather than number of requests (and group by remote_addr again), these are the worst consumers:<pre><code> body_megabytes_per_year user_agent 149.75943975 Refeed Reader/v1 (+https://www.refeed.dev/) 95.90771025 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 75.00080025 rss-parser 73.023702 Tiny Tiny RSS/24.09-0163884ef (Unsupported) (https://tt-rss.org/) 38.402385 Tiny Tiny RSS/24.11-42ebdb02 (https://tt-rss.org/) 37.984539 Selfoss/2.20-cf74581 (+https://selfoss.aditu.de) 30.3982965 NetNewsWire (RSS Reader; https://netnewswire.com/) 28.18013325 Tiny Tiny RSS/23.04-0578bf80 (https://tt-rss.org/) 26.330142 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36 24.838461 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36 </code></pre> The top consumer, Refeed, is responsible for about 2.25% of all egress of my webserver. (Counting only body bytes, not http overhead.)[1]: <a href="https://ruudvanasseldonk.com/writing" rel="nofollow">https://ruudvanasseldonk.com/writing</a> [2]: <a href="https://github.com/ruuda/sqlog/blob/d129db35da9bbf95d8c2e97d575b4d5beb3bb40c/queries/feed_readers.sql">https://github.com/ruuda/sqlog/blob/d129db35da9bbf95d8c2e97d...</a>

65105 months ago

I ban the feed for 24 hours if it doesnt work.I also design 2 new formats that no one (including myself) has ever implemented.<a href="https://go-here.nl/ess-and-nno" rel="nofollow">https://go-here.nl/ess-and-nno</a>enjoy

internet20005 months ago

Does anyone know if FreshRSS behaves properly here?

评论 #42486717 未加载

Forge365 months ago

I couldn't find the tester. Thankfully the client i was tested... And it behaves poorly. Thankfully emacs has a client I can switch to!

mixmastamyk5 months ago

I have a few feeds configured into Thunderbird but wasn’t reading them very often, so I “disabled” them to load manually. Despite this it tries to contact the sites often and, when not able to (firewall) goes into a frenzy of trying to contact them. All this despite being disabled.Disappointing combined with the various update sites it tries to contact every startup, which is completely unnecessary as well. Couple of times a week should be the maximum rate.

euroderf5 months ago

which => that

评论 #42487993 未加载

kelsey987654315 months ago

if you have to 429 people for an rss feed the problem is you

评论 #42486284 未加载

评论 #42486311 未加载

评论 #42486265 未加载

评论 #42486592 未加载

评论 #42486242 未加载