I run NewsBlur[0] and I’ve been battling this issue of NewsBlur fetching 403s across the web for months now. My users are revolting and asking for refunds. I’ve tried emailing dozens of site owners and publishers and only two of them have done the work of whitelisting their RSS feed. It’s maddening and is having a real negative effect on NewsBlur.<p>NewsBlur is an open-source RSS news reader (full source available at [1]), something we should all agree is necessary to support the open web! But Cloudflare blocking all of my feed fetchers is bizarre behavior. And we’re on the verified bots list for years, but it hasn’t made a difference.<p>Let me know what I can do. NewsBlur publishes a list of IPs that it uses for feed fetching that I've shared with Cloudflare but it hasn't made a difference.<p>I'm hoping Cloudflare uses the IP address list that I publish and adds them to their allowlist so NewsBlur can keep fetching (and archiving) millions of feeds.<p>[0]: <a href="https://newsblur.com" rel="nofollow">https://newsblur.com</a><p>[1]: <a href="https://github.com/samuelclay/NewsBlur">https://github.com/samuelclay/NewsBlur</a>
I dislike advice of whitelisting specific readers by user-agent. Not only is this endless manual work that will only solve the problem for a subset of users but it also is easy to bypass by malicious actors. My recommendation would be to create a page rule that disables bot blocking for your feeds. This will fix the problem for all readers with no ongoing maintenance.<p>If you are worried about DoS attacks that may hammer on your feeds then you can use the same configuration rule to ignore the query string for cache keys (if your feed doesn't use query strings) and overriding the caching settings if your server doesn't set the proper headers. This way Cloudflare will cache your feed and you can serve any number of visitors without putting load onto your origin.<p>As for Cloudflare fixing the defaults, it seems unlikely to happen. It has been broken for years, Cloudflare's own blog is affected. They have been "actively working" on fixing it for at least 2 years according to their VP of product: <a href="https://news.ycombinator.com/item?id=33675847">https://news.ycombinator.com/item?id=33675847</a>
At Listen Notes, we rely heavily on Cloudflare to manage and protect our services, which cater to both human users and scripts/bots.<p>One particularly effective strategy we've implemented is using separate subdomains for services designed for different types of traffic, allowing us to apply customized firewall and page rules to each subdomain.<p>For example:<p>- www. listennotes.com is dedicated to human users. E.g., <a href="https://www.listennotes.com/podcast-realtime/" rel="nofollow">https://www.listennotes.com/podcast-realtime/</a><p>- feeds. listennotes.com is tailored for bots, providing access to RSS feeds. Eg., <a href="https://feeds.listennotes.com/listen/wenbin-fangs-podcast-playlist-kr3-ta28cJu/rss/" rel="nofollow">https://feeds.listennotes.com/listen/wenbin-fangs-podcast-pl...</a><p>- audio. listennotes.com serves both humans and bots, handling audio URL proxies. E.g., <a href="https://audio.listennotes.com/e/p/1a0b2d081cae4d6d9889c496513646e2/" rel="nofollow">https://audio.listennotes.com/e/p/1a0b2d081cae4d6d9889c49651...</a><p>This subdomain-based approach enables us to fine-tune security and performance settings for each type of traffic, ensuring optimal service delivery.
I get blocked from websites with some regularity, running Firefox with strict privacy settings, "resist fingerprinting" etc. on OpenBSD. They just give a 403 Forbidden with no explanation, but it's only ever on sites fronted by CloudFlare. Good times. Seems legit.
My email is jgc@cloudflare.com. I'd like to hear from the owners of RSS readers directly on what they are experiencing. Going to ask team to take a closer look.
As the owner of an RSS reader I love that they are making this more public. 30% of our support requests are ”my feed doesn’t” work. It sucks that the only thing we can say is ”contact the site owner, it’s their firewall”. And to be fair it’s not only Cloudflare, so many different firewall setups cause issues. It’s ironic that a public API endpoint meant for bots is blocked for being a bot.
I maintain an RSS reader for work and Cloudflare is the bane of my existence. Tons of feeds will stop working at random and there’s nothing we can do about it except for individually contacting website owners and asking them to add an exception for their feed URL.
Using Cloudflare on your website could be blocking Safari users, Chrome users, or just any users. It’s totally broken. They have no way of measuring the false positives. Website owners are paying for it in lost revenue. And poor users who lose access for no fault of their own. Until some C-level exec at a BigTech randomly gets blocked and makes noise. But even then, Cloudflare will probably just whitelist that specific domain/IP. It is very interesting how I have never been blocked when trying to access Cloudflare itself, only blocked on their customer’s sites.
Cloudflare has been the bane of my web existance on Thai IP and a Linux Firefox fingerprint. I wonder how much traffic is lost because of Cloudflare and of course none of that is reported to the web admins so everyone continues with their jolly ignorance.<p>I wrote my own RSS bridge that scrapes websites using Scrapfly web scraping API that bypasses all that because it's so annoying that I can't even scrape some company's /blog that they are literally buying ads for but somehow have an anti-bot enabled that blocks all RSS readers.<p>Modern web is so anti social that the web 2.0 guys should be rolling in their "everything will be connected with APIs" graves by now.
My company runs a tech news website. We offer RSS feed as any Drupal website would, which content farm just scrape our RSS feed to rehost our content in full. This is usually fine for us - the content is CC-licensed and they do post the correct source. But they run thousands of different WordPress instances on the same IP and they individually fetch the feed.<p>In the end we had to use Cloudflare to rate limit the RSS endpoint.
Not "could" but it is actually blocking. Very annoying when government website does that, as usually it is next to impossible to explain the issue and ask for a fix. And even if the fix is made, it is reverted several weeks later. Other websites does that too, it was funny when one website was asking RSS reader to resolve captcha and prove they are human.
In any case, it blocks German Telekom users. There is an ongoing dispute between Cloudflare and Telekom as to who pays for the traffic costs. Telekom is therefore throttling connections to Cloudflare. This is the reason why we can no longer use Cloudflare.
My employer, Read the Docs is a heavy user of Cloudflare. It's actually hard to imagine serving as much traffic as we do as cheaply as we can without them.<p>That said, for publicly hosted open source documentation, we turn down the security settings almost all the way. Security level is set to "essentially off" (that's the actual setting name), no browser integrity check, TOR friendly (onion routing on), etc. We still have rate limits in place but they're pretty generous (~4 req/s sustained). For sites that don't require a login and don't accept inbound leads or something like that, that's probably around the right level. Our domains where doc authors manage their docs have higher security settings.<p>That said, being too generous can get you into trouble so I understand why people crank up the settings and just block some legitimate traffic. See our past post where AI scrapers scraped almost 100TB (<a href="https://news.ycombinator.com/item?id=41072549">https://news.ycombinator.com/item?id=41072549</a>).
This is an active issue with Rate Your Music right now: <a href="https://rateyourmusic.com/rymzilla/view?id=6108" rel="nofollow">https://rateyourmusic.com/rymzilla/view?id=6108</a><p>Unfixed for 4 months.
"could be blocking RSS users" it says it all "could". I use RSS on my websites, which are serviced by Cloudflare, and my users are not blocked. For that, fine-tuning and setting Configuration Rules at Cloudflare Dashboard are required. Anyone on a free has access to 10 Configuration Rules. I prefer using Cloudflare Workers to tune better, but there is a cost. My suggestion for RSS these days is to reduce the info on RSS feed to teasers, AI bots are using RSS to circumvent bans, and continue to scrape.
I’m happy to see that a post regarding the use of RSS gets so much attention on HN. It’s a good sign. As I basically live in my feed reader since 2007 or so, one of my greatest fears is the slow demise of RSS by way of reduced support of RSS feeds by websites owners.
Can you whitelists urls to be readead by bot on Cloudflare? Maybe this is a good solution, and there you can put your RSS feeds, sitemaps, and other content for bots. Also Cloudflare can make a dedicated fields to whitelists RSS and Sitemaps on the admin panel so users can discover more easily that they may don't want block those bots.<p>Can you whitelist URLs to be read by bots on Cloudflare? Maybe this is a good solution, where you as a site mantainer can include your RSS feeds, sitemaps, and other content for bots.<p>Also, Cloudflare could ship a feature by creating a dedicated section in the admin panel to let the user add and whitelist RSS feeds and sitemaps, making it easier (and educate) users to avoid blocking those bots who aren't a threat to your site, of course sill considering rules to avoid DDOS on this urls, like massive requests or stuff that common bots from RSS readers don't do.
I see this on a regular basis. My self-hosted RSS reader is blocked by Cloudflare even after my IP address was explicitly allowlisted by a few feed owners.
As an admin of my personal website, I completely disable all Cloudflare features and use it only for DNS and domain registration. I also stop following websites that use Cloudflare checks or cookie popups (cookies are fine, but the popups are annoying).
This is a truly problematic issue that I've experienced as well. The best solution is probably for Cloudflare to figure out what normal RSS usage looks like and have a provision for that in their bot detection.
iirc even if you're listed as a "good bot" with Cloudflare, high security settings by the CF user can still result in 403s.<p>No idea if CF already does this, but allowing users to generate access tokens for 3rd party services would be another way of easing access alongside their apparent URL and IP whitelisting.
I believe this also pose issues to people running adblockers. I get tons of repetitive captchas on some websites.<p>Also other companies offering similar services like imperva seems to be straight banning my ip after one visit to a website with uBlock Origin I first get a captcha, then a page saying I am not allowed, and whatever I do, even using an extensionless chrome browser with a new profile I can't visit it anymore because my ip is banned.
I'd have thought the website owner whitelisting their RSS feed URI (or pattern matching *.xml/*.rss) might be better than doing it based on the users agent string. For one you'd expect bot traffic on these end points and you're also not leaving a door open to anyone who fakes their user agent.<p>Looks like it should be possible under the WAF
I was bitten by this as well. My product retrieves RSS feeds from public government sites, and suddely I'm blocked by cloudflair's antibotting for tryng to access a page that was specifically created for machine consumption. It is not that the website owner or publisher intend to block this. They are unaware that turng on Cloudflare will block everything, even stuff allowed to be consumed according to robots.txt .<p>P.S. when I mentioned this here on HN a few weeks back, it was implied that I probably did not respect robots.txt ( I do, Cloudflair does not) or that I should get in touch with the site administrators (impossible to do in any reasonably effective way at scale).
I noticed this a while back when I was trying to read cloudflare's own blog.
Periodically they would block my newsreader.
I ended up just dropping their feed.<p>I am glad to see other people calling out the problem.
Hopefully, a solution will emerge.
I've noticed that Old Reddit still supports RSS feeds without returning a 403 error. This is in contrast to the main site, which often blocks RSS requests.<p>Here are some DNS details:<p>The main Reddit site (www.reddit.com) uses Fastly.
Old Reddit (old.reddit.com) also uses Fastly.
However, the "vomit" address (which often returns 403s for RSS requests) uses AWS DNS.
Is Old Reddit not behind Cloudflare, or is there another reason why it handles RSS requests differently?
Ah, the Cloudflare free plan does not automatically turn these on. I know since I use it for some small things and don't have these on. I wouldn't use User-Agent filtering because those are spoofable. But putting feeds on a separate URL is probably a good idea. Right now the feed is actually generated on request for these sites, so caching it is probably a good idea anyway. I can just rudimentarily do that by periodically generating and copying it over.
Suggesting that website operators should allowlist RSS clients through the Cloudflare bot detection system via their user-agent is a rather concerning recommendation.
This is an issue with techdirt.com. I contacted them about this through their feedback form a long time ago, but the issue still remains unfortunately.
It also manages to break IRC bots that do things like show the contents of the title tag when someone posts a link. Another cloudy annoyance, albeit a minor one.