Using HTTP meta-headers is actually something we seem to have forgotten how to do.<p>The one that annoys me most is the accept-language header which is almost entirely ignored in favour of GeoIP lookups to figure out regionality... which I find <i>super odd</i>; as if people are walking around using a browser in a language they don't speak. (or, an operating system configured for a language they don't speak).<p>ETAG's though, are a bit fraught- if you're a company, a security scan will fire if an etag is detected because you <i>might</i> be able to figure out the inode on the filesystem based on it... which, idk why that's a security problem eitherway[0], but it's common for there to be false-positives[1]... which makes people not respect the header.<p>Last-Modified should work though, I love the idea of checking headers and not content.<p>I think people don't care to imagine the computer doing <i>as little as possible</i> to get the job done, and instead use the near unlimited computing power to just avoid thinking about consequences.<p>[0]: <a href="https://www.pentestpartners.com/security-blog/vulnerabilities-that-arent-etag-headers/" rel="nofollow">https://www.pentestpartners.com/security-blog/vulnerabilitie...</a><p>[1]: <a href="https://github.com/sullo/nikto/issues/469">https://github.com/sullo/nikto/issues/469</a>
Bonus point for clients that don't support the HTTP “Accept-Encoding” header [1] and consume all your bandwidth.<p>[1] <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Encoding" rel="nofollow">https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Ac...</a>
I always appreciate Rachel's writings. I don't know much about her, but my takeaway is that she has worked at some of the hardest sysadmin jobs in the past few decades and writes to her experience super well.
IMO, it is also unreasonable to have ultra-restrictive rate limits, like blocking a client after one request.<p><a href="https://rachelbythebay.com/w/atom.xml" rel="nofollow">https://rachelbythebay.com/w/atom.xml</a>
Rachel makes an excellent point here about feed change frequency.<p>Seems like it'd be straightforward to implement a backoff strategy based on how frequently the feed content changed into most readers. For a regular, periodic fetch, if the content has proven it doesn't update frequently, just back off the period for that endpoint.
If-Modified-Since and ETag are nice and everyone should implement them but IME the implementation status is much better on the reader side than on the feed side. Trim your (main) feed to only recent posts and use Atom's paginatio to link to the rest for new subscribers and the difference in data transferred becomes much smaller.<p>> Besides that, a well-behaved feed will have the same content as what you will get on the actual web site. The HTML might be slightly different to account for any number of failings in stupid feed readers in order to save the people using those programs from themselves, but the actual content should be the same. Given that, there's an important thing to take away from this: there is no reason to request every single $(<i>&^$</i>(&^@#* post that's mentioned in the feed.<p>> If you pull the feed, don't pull the posts. If you pull the posts, don't pull the feed. If you pull both, you're missing the whole point of having an aggregated feed!<p>Unfortunately there are too many feeds that <i>don't</i> include the full content for this to work. And a reader won't know if the feed has the full content before fetching the HTML page. This can also change from post to post so it can't just determine this when subscribing.<p>> Then there are the user-agents who lie about who they are or where they are coming from because they think it's going to get them special treatment somehow.<p>These exist because of misbehaved web servers that block based on user agen't or send different content. And since you are complaining about faked user agents that probably includes you.<p>> Sending referrers which make no sense is just bad manners.<p>HTTP Referer should not exist. And has been abused by spammers for ages.
> If you pull the feed, don't pull the posts. If you pull the posts, don't pull the feed. If you pull both, you're missing the whole point of having an aggregated feed!<p>People probably do this because some sites only give you a preview in the feed, to force you to go to the site and view the ads.<p>So if you want the full post in the feed reader, you need to pull the post as well.
WebSub is your friend here: <a href="https://www.w3.org/TR/websub/" rel="nofollow">https://www.w3.org/TR/websub/</a><p>This adds a nice publish-subscribe model to RSS. Ping the WebSub server when there are changes; subscribing services are easily notified; nobody has to worry about excessive polling. Hooray.
RSS feeds have a TTL inside the feed. Do feed readers respect it?<p><a href="https://www.rssboard.org/rss-draft-1#element-channel-ttl" rel="nofollow">https://www.rssboard.org/rss-draft-1#element-channel-ttl</a>
for a feed reader i built that year (2022) i was polling each feed every second day, except the ones which had a new item within 50 days -> every day
I’m 100% sure there are many badly written inefficient crawlers that are wasting server resources and resources where they run but I use feed readers a lot and it is very hard to find well maintained feeds. Many servers also use cache related headers incorrectly or don’t use them at all.
This is a good lesson on being a good citizen of the Internet.<p>It's easy to just curl a feed every second, but should you? (Of course not)<p>Take it as a challenge to make your reader as fancy as possible, use every trick in the book to optimise how it fetches content. Analyse the patterns of releasing new content per feed and adjust the fetch frequency based on that.<p>And if you're building a reader for distribution, don't let the user set a refresh interval that doesn't make sense.
> If you pull the feed, don't pull the posts. If you pull the posts, don't pull the feed. If you pull both, you're missing the whole point of having an aggregated feed!<p>In some cases the reader <i>should</i> fetch both the feed and the pages. Unfortunately, none do<p><a href="https://github.com/miniflux/v2/issues/3084">https://github.com/miniflux/v2/issues/3084</a>