Using Cloudflare on your website could be blocking RSS users

559 pointsby campuscodi7 months ago

44 comments

conesus7 months ago

I run NewsBlur[0] and I’ve been battling this issue of NewsBlur fetching 403s across the web for months now. My users are revolting and asking for refunds. I’ve tried emailing dozens of site owners and publishers and only two of them have done the work of whitelisting their RSS feed. It’s maddening and is having a real negative effect on NewsBlur.NewsBlur is an open-source RSS news reader (full source available at [1]), something we should all agree is necessary to support the open web! But Cloudflare blocking all of my feed fetchers is bizarre behavior. And we’re on the verified bots list for years, but it hasn’t made a difference.Let me know what I can do. NewsBlur publishes a list of IPs that it uses for feed fetching that I've shared with Cloudflare but it hasn't made a difference.I'm hoping Cloudflare uses the IP address list that I publish and adds them to their allowlist so NewsBlur can keep fetching (and archiving) millions of feeds.[0]: <a href="https://newsblur.com" rel="nofollow">https://newsblur.com</a>[1]: <a href="https://github.com/samuelclay/NewsBlur">https://github.com/samuelclay/NewsBlur</a>

评论 #41872099 未加载

评论 #41871518 未加载

评论 #41872239 未加载

评论 #41873211 未加载

评论 #41873748 未加载

评论 #41894757 未加载

评论 #41886124 未加载

评论 #41873147 未加载

kevincox7 months ago

I dislike advice of whitelisting specific readers by user-agent. Not only is this endless manual work that will only solve the problem for a subset of users but it also is easy to bypass by malicious actors. My recommendation would be to create a page rule that disables bot blocking for your feeds. This will fix the problem for all readers with no ongoing maintenance.If you are worried about DoS attacks that may hammer on your feeds then you can use the same configuration rule to ignore the query string for cache keys (if your feed doesn't use query strings) and overriding the caching settings if your server doesn't set the proper headers. This way Cloudflare will cache your feed and you can serve any number of visitors without putting load onto your origin.As for Cloudflare fixing the defaults, it seems unlikely to happen. It has been broken for years, Cloudflare's own blog is affected. They have been "actively working" on fixing it for at least 2 years according to their VP of product: <a href="https://news.ycombinator.com/item?id=33675847">https://news.ycombinator.com/item?id=33675847</a>

评论 #41869223 未加载

评论 #41867168 未加载

评论 #41868163 未加载

wenbin7 months ago

At Listen Notes, we rely heavily on Cloudflare to manage and protect our services, which cater to both human users and scripts/bots.One particularly effective strategy we've implemented is using separate subdomains for services designed for different types of traffic, allowing us to apply customized firewall and page rules to each subdomain.For example:- www. listennotes.com is dedicated to human users. E.g., <a href="https://www.listennotes.com/podcast-realtime/" rel="nofollow">https://www.listennotes.com/podcast-realtime/</a>- feeds. listennotes.com is tailored for bots, providing access to RSS feeds. Eg., <a href="https://feeds.listennotes.com/listen/wenbin-fangs-podcast-playlist-kr3-ta28cJu/rss/" rel="nofollow">https://feeds.listennotes.com/listen/wenbin-fangs-podcast-pl...</a>- audio. listennotes.com serves both humans and bots, handling audio URL proxies. E.g., <a href="https://audio.listennotes.com/e/p/1a0b2d081cae4d6d9889c496513646e2/" rel="nofollow">https://audio.listennotes.com/e/p/1a0b2d081cae4d6d9889c49651...</a>This subdomain-based approach enables us to fine-tune security and performance settings for each type of traffic, ensuring optimal service delivery.

评论 #41871246 未加载

amatecha7 months ago

I get blocked from websites with some regularity, running Firefox with strict privacy settings, "resist fingerprinting" etc. on OpenBSD. They just give a 403 Forbidden with no explanation, but it's only ever on sites fronted by CloudFlare. Good times. Seems legit.

评论 #41868030 未加载

评论 #41867245 未加载

评论 #41869190 未加载

评论 #42002463 未加载

评论 #41868594 未加载

评论 #41869439 未加载

评论 #41871086 未加载

评论 #41869685 未加载

评论 #41868383 未加载

评论 #41867658 未加载

评论 #41873407 未加载

评论 #41873926 未加载

评论 #41867420 未加载

评论 #41869823 未加载

jgrahamc7 months ago

My email is jgc@cloudflare.com. I'd like to hear from the owners of RSS readers directly on what they are experiencing. Going to ask team to take a closer look.

评论 #41867836 未加载

评论 #41869657 未加载

评论 #41867476 未加载

评论 #41869258 未加载

评论 #41868190 未加载

评论 #41868888 未加载

评论 #41876633 未加载

erikrothoff7 months ago

As the owner of an RSS reader I love that they are making this more public. 30% of our support requests are ”my feed doesn’t” work. It sucks that the only thing we can say is ”contact the site owner, it’s their firewall”. And to be fair it’s not only Cloudflare, so many different firewall setups cause issues. It’s ironic that a public API endpoint meant for bots is blocked for being a bot.

belkinpower7 months ago

I maintain an RSS reader for work and Cloudflare is the bane of my existence. Tons of feeds will stop working at random and there’s nothing we can do about it except for individually contacting website owners and asking them to add an exception for their feed URL.

评论 #41867066 未加载

评论 #41867055 未加载

评论 #41867945 未加载

elwebmaster7 months ago

Using Cloudflare on your website could be blocking Safari users, Chrome users, or just any users. It’s totally broken. They have no way of measuring the false positives. Website owners are paying for it in lost revenue. And poor users who lose access for no fault of their own. Until some C-level exec at a BigTech randomly gets blocked and makes noise. But even then, Cloudflare will probably just whitelist that specific domain/IP. It is very interesting how I have never been blocked when trying to access Cloudflare itself, only blocked on their customer’s sites.

wraptile7 months ago

Cloudflare has been the bane of my web existance on Thai IP and a Linux Firefox fingerprint. I wonder how much traffic is lost because of Cloudflare and of course none of that is reported to the web admins so everyone continues with their jolly ignorance.I wrote my own RSS bridge that scrapes websites using Scrapfly web scraping API that bypasses all that because it's so annoying that I can't even scrape some company's /blog that they are literally buying ads for but somehow have an anti-bot enabled that blocks all RSS readers.Modern web is so anti social that the web 2.0 guys should be rolling in their "everything will be connected with APIs" graves by now.

评论 #41870021 未加载

whs7 months ago

My company runs a tech news website. We offer RSS feed as any Drupal website would, which content farm just scrape our RSS feed to rehost our content in full. This is usually fine for us - the content is CC-licensed and they do post the correct source. But they run thousands of different WordPress instances on the same IP and they individually fetch the feed.In the end we had to use Cloudflare to rate limit the RSS endpoint.

评论 #41868962 未加载

评论 #41869978 未加载

butz7 months ago

Not "could" but it is actually blocking. Very annoying when government website does that, as usually it is next to impossible to explain the issue and ask for a fix. And even if the fix is made, it is reverted several weeks later. Other websites does that too, it was funny when one website was asking RSS reader to resolve captcha and prove they are human.

MarvinYork7 months ago

In any case, it blocks German Telekom users. There is an ongoing dispute between Cloudflare and Telekom as to who pays for the traffic costs. Telekom is therefore throttling connections to Cloudflare. This is the reason why we can no longer use Cloudflare.

评论 #41868800 未加载

评论 #41881675 未加载

davidfischer7 months ago

My employer, Read the Docs is a heavy user of Cloudflare. It's actually hard to imagine serving as much traffic as we do as cheaply as we can without them.That said, for publicly hosted open source documentation, we turn down the security settings almost all the way. Security level is set to "essentially off" (that's the actual setting name), no browser integrity check, TOR friendly (onion routing on), etc. We still have rate limits in place but they're pretty generous (~4 req/s sustained). For sites that don't require a login and don't accept inbound leads or something like that, that's probably around the right level. Our domains where doc authors manage their docs have higher security settings.That said, being too generous can get you into trouble so I understand why people crank up the settings and just block some legitimate traffic. See our past post where AI scrapers scraped almost 100TB (<a href="https://news.ycombinator.com/item?id=41072549">https://news.ycombinator.com/item?id=41072549</a>).

mbo7 months ago

This is an active issue with Rate Your Music right now: <a href="https://rateyourmusic.com/rymzilla/view?id=6108" rel="nofollow">https://rateyourmusic.com/rymzilla/view?id=6108</a>Unfixed for 4 months.

veeti7 months ago

I believe that disabling "Bot Fight Mode" is not enough, you may also need to create a rule to disable "Browser Integrity Check".

hugoromano7 months ago

"could be blocking RSS users" it says it all "could". I use RSS on my websites, which are serviced by Cloudflare, and my users are not blocked. For that, fine-tuning and setting Configuration Rules at Cloudflare Dashboard are required. Anyone on a free has access to 10 Configuration Rules. I prefer using Cloudflare Workers to tune better, but there is a cost. My suggestion for RSS these days is to reduce the info on RSS feed to teasers, AI bots are using RSS to circumvent bans, and continue to scrape.

imartin2k7 months ago

I’m happy to see that a post regarding the use of RSS gets so much attention on HN. It’s a good sign. As I basically live in my feed reader since 2007 or so, one of my greatest fears is the slow demise of RSS by way of reduced support of RSS feeds by websites owners.

pentagrama7 months ago

Can you whitelists urls to be readead by bot on Cloudflare? Maybe this is a good solution, and there you can put your RSS feeds, sitemaps, and other content for bots. Also Cloudflare can make a dedicated fields to whitelists RSS and Sitemaps on the admin panel so users can discover more easily that they may don't want block those bots.Can you whitelist URLs to be read by bots on Cloudflare? Maybe this is a good solution, where you as a site mantainer can include your RSS feeds, sitemaps, and other content for bots.Also, Cloudflare could ship a feature by creating a dedicated section in the admin panel to let the user add and whitelist RSS feeds and sitemaps, making it easier (and educate) users to avoid blocking those bots who aren't a threat to your site, of course sill considering rules to avoid DDOS on this urls, like massive requests or stuff that common bots from RSS readers don't do.

pointlessone7 months ago

I see this on a regular basis. My self-hosted RSS reader is blocked by Cloudflare even after my IP address was explicitly allowlisted by a few feed owners.

tandav7 months ago

As an admin of my personal website, I completely disable all Cloudflare features and use it only for DNS and domain registration. I also stop following websites that use Cloudflare checks or cookie popups (cookies are fine, but the popups are annoying).

artooro7 months ago

This is a truly problematic issue that I've experienced as well. The best solution is probably for Cloudflare to figure out what normal RSS usage looks like and have a provision for that in their bot detection.

ricardo817 months ago

iirc even if you're listed as a "good bot" with Cloudflare, high security settings by the CF user can still result in 403s.No idea if CF already does this, but allowing users to generate access tokens for 3rd party services would be another way of easing access alongside their apparent URL and IP whitelisting.

account427 months ago

Or just normal human users with a niche browser like Firefox.

prmoustache7 months ago

I believe this also pose issues to people running adblockers. I get tons of repetitive captchas on some websites.Also other companies offering similar services like imperva seems to be straight banning my ip after one visit to a website with uBlock Origin I first get a captcha, then a page saying I am not allowed, and whatever I do, even using an extensionless chrome browser with a new profile I can't visit it anymore because my ip is banned.

评论 #41868970 未加载

ectospheno7 months ago

I love that I get a cloudflare human check on almost every page they serve for customers except for when I login to my cloudflare account. Good times.

评论 #41871329 未加载

srmarm7 months ago

I'd have thought the website owner whitelisting their RSS feed URI (or pattern matching *.xml/*.rss) might be better than doing it based on the users agent string. For one you'd expect bot traffic on these end points and you're also not leaving a door open to anyone who fakes their user agent.Looks like it should be possible under the WAF

rcarmo7 months ago

Ironically, the site seems to currently be hugged to death, so maybe they should consider using Cloudflare to deal with HN traffic?

评论 #41867964 未加载

评论 #41867409 未加载

PeterStuer7 months ago

I was bitten by this as well. My product retrieves RSS feeds from public government sites, and suddely I'm blocked by cloudflair's antibotting for tryng to access a page that was specifically created for machine consumption. It is not that the website owner or publisher intend to block this. They are unaware that turng on Cloudflare will block everything, even stuff allowed to be consumed according to robots.txt .P.S. when I mentioned this here on HN a few weeks back, it was implied that I probably did not respect robots.txt ( I do, Cloudflair does not) or that I should get in touch with the site administrators (impossible to do in any reasonably effective way at scale).

drudru7 months ago

I noticed this a while back when I was trying to read cloudflare's own blog. Periodically they would block my newsreader. I ended up just dropping their feed.I am glad to see other people calling out the problem. Hopefully, a solution will emerge.

samplifier7 months ago

I've noticed that Old Reddit still supports RSS feeds without returning a 403 error. This is in contrast to the main site, which often blocks RSS requests.Here are some DNS details:The main Reddit site (www.reddit.com) uses Fastly. Old Reddit (old.reddit.com) also uses Fastly. However, the "vomit" address (which often returns 403s for RSS requests) uses AWS DNS. Is Old Reddit not behind Cloudflare, or is there another reason why it handles RSS requests differently?

评论 #41894289 未加载

est7 months ago

Hmmm, that's why "feedburner" is^H^Hwas a thing, right?We have come to full circle.

评论 #41868972 未加载

renewiltord7 months ago

Ah, the Cloudflare free plan does not automatically turn these on. I know since I use it for some small things and don't have these on. I wouldn't use User-Agent filtering because those are spoofable. But putting feeds on a separate URL is probably a good idea. Right now the feed is actually generated on request for these sites, so caching it is probably a good idea anyway. I can just rudimentarily do that by periodically generating and copying it over.

015a7 months ago

Suggesting that website operators should allowlist RSS clients through the Cloudflare bot detection system via their user-agent is a rather concerning recommendation.

soraminazuki7 months ago

This is an issue with techdirt.com. I contacted them about this through their feedback form a long time ago, but the issue still remains unfortunately.

nfriedly7 months ago

Liliputing.com had this problem a couple of years ago. I emailed the author and he got it sorted out after a bit of back and forth.

hwj7 months ago

I had problems accessing Cloudflare-hosted websites via the Tor browser also. Don't know it that is still true.

timnetworks7 months ago

RSS is the future that is being kept from us for twenty years already, fusion can kick bricks.

qwertyuiop_7 months ago

I have always suspected cloudflare being a classic intelligence community op. Just like Google was funded by qinetq

hkt7 months ago

It also manages to break IRC bots that do things like show the contents of the title tag when someone posts a link. Another cloudy annoyance, albeit a minor one.

3np7 months ago

Also: Sign in on gitlab.com is broken for me on Tor Browser because of an infinite "Verify you are human" refresh/redirect loop...

dewey7 months ago

I’m using Miniflix and I always run into that on a few blogs which now I just stopped reading.

shaunpud7 months ago

Namesilo are the same, their csv/rss behind Cloudflare so don't even bother anymore with their auctions and their own interface is meh

anilakar7 months ago

...and there is a good number of people who see this as a feature, not a bug.

idunnoman12227 months ago

Yes, the way to retain your privacy is to not use the Internetif you don’t like it, make your own Internet: assumedly one not funded by ads