A sysadmin's rant about feed readers and crawlers (2022)

57 pointsby leonry3 months ago

16 comments

dijit3 months ago

Using HTTP meta-headers is actually something we seem to have forgotten how to do.The one that annoys me most is the accept-language header which is almost entirely ignored in favour of GeoIP lookups to figure out regionality... which I find super odd; as if people are walking around using a browser in a language they don't speak. (or, an operating system configured for a language they don't speak).ETAG's though, are a bit fraught- if you're a company, a security scan will fire if an etag is detected because you might be able to figure out the inode on the filesystem based on it... which, idk why that's a security problem eitherway[0], but it's common for there to be false-positives[1]... which makes people not respect the header.Last-Modified should work though, I love the idea of checking headers and not content.I think people don't care to imagine the computer doing as little as possible to get the job done, and instead use the near unlimited computing power to just avoid thinking about consequences.[0]: <a href="https://www.pentestpartners.com/security-blog/vulnerabilities-that-arent-etag-headers/" rel="nofollow">https://www.pentestpartners.com/security-blog/vulnerabilitie...</a>[1]: <a href="https://github.com/sullo/nikto/issues/469">https://github.com/sullo/nikto/issues/469</a>

评论 #43266259 未加载

评论 #43266425 未加载

评论 #43267633 未加载

评论 #43266793 未加载

评论 #43266031 未加载

gildas3 months ago

Bonus point for clients that don't support the HTTP “Accept-Encoding” header [1] and consume all your bandwidth.[1] <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Encoding" rel="nofollow">https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Ac...</a>

评论 #43267102 未加载

tomrod3 months ago

I always appreciate Rachel's writings. I don't know much about her, but my takeaway is that she has worked at some of the hardest sysadmin jobs in the past few decades and writes to her experience super well.

someothherguyy3 months ago

IMO, it is also unreasonable to have ultra-restrictive rate limits, like blocking a client after one request.<a href="https://rachelbythebay.com/w/atom.xml" rel="nofollow">https://rachelbythebay.com/w/atom.xml</a>

评论 #43267006 未加载

评论 #43266825 未加载

评论 #43266598 未加载

Aeolun3 months ago

160 gigabytes of feed over the course of a month (when polling a 640kb feed every 10 seconds), in case anyone else was wondering.

shadowgovt3 months ago

Rachel makes an excellent point here about feed change frequency.Seems like it'd be straightforward to implement a backoff strategy based on how frequently the feed content changed into most readers. For a regular, periodic fetch, if the content has proven it doesn't update frequently, just back off the period for that endpoint.

account423 months ago

If-Modified-Since and ETag are nice and everyone should implement them but IME the implementation status is much better on the reader side than on the feed side. Trim your (main) feed to only recent posts and use Atom's paginatio to link to the rest for new subscribers and the difference in data transferred becomes much smaller.> Besides that, a well-behaved feed will have the same content as what you will get on the actual web site. The HTML might be slightly different to account for any number of failings in stupid feed readers in order to save the people using those programs from themselves, but the actual content should be the same. Given that, there's an important thing to take away from this: there is no reason to request every single $(&^$(&^@#* post that's mentioned in the feed.> If you pull the feed, don't pull the posts. If you pull the posts, don't pull the feed. If you pull both, you're missing the whole point of having an aggregated feed!Unfortunately there are too many feeds that don't include the full content for this to work. And a reader won't know if the feed has the full content before fetching the HTML page. This can also change from post to post so it can't just determine this when subscribing.> Then there are the user-agents who lie about who they are or where they are coming from because they think it's going to get them special treatment somehow.These exist because of misbehaved web servers that block based on user agen't or send different content. And since you are complaining about faked user agents that probably includes you.> Sending referrers which make no sense is just bad manners.HTTP Referer should not exist. And has been abused by spammers for ages.

评论 #43266083 未加载

jstanley3 months ago

> If you pull the feed, don't pull the posts. If you pull the posts, don't pull the feed. If you pull both, you're missing the whole point of having an aggregated feed!People probably do this because some sites only give you a preview in the feed, to force you to go to the site and view the ads.So if you want the full post in the feed reader, you need to pull the post as well.

评论 #43266597 未加载

benwerd3 months ago

WebSub is your friend here: <a href="https://www.w3.org/TR/websub/" rel="nofollow">https://www.w3.org/TR/websub/</a>This adds a nice publish-subscribe model to RSS. Ping the WebSub server when there are changes; subscribing services are easily notified; nobody has to worry about excessive polling. Hooray.

theandrewbailey3 months ago

RSS feeds have a TTL inside the feed. Do feed readers respect it?<a href="https://www.rssboard.org/rss-draft-1#element-channel-ttl" rel="nofollow">https://www.rssboard.org/rss-draft-1#element-channel-ttl</a>

评论 #43267595 未加载

评论 #43266148 未加载

croisillon3 months ago

for a feed reader i built that year (2022) i was polling each feed every second day, except the ones which had a new item within 50 days -> every day

评论 #43266093 未加载

ffjffsfr3 months ago

I’m 100% sure there are many badly written inefficient crawlers that are wasting server resources and resources where they run but I use feed readers a lot and it is very hard to find well maintained feeds. Many servers also use cache related headers incorrectly or don’t use them at all.

yapyap3 months ago

<a href="https://web.archive.org/web/20241205224611/http://rachelbythebay.com/w/2022/03/07/get/" rel="nofollow">https://web.archive.org/web/20241205224611/http://rachelbyth...</a>

balamatom3 months ago

So nice to see RSS making a comeback!

theshrike793 months ago

This is a good lesson on being a good citizen of the Internet.It's easy to just curl a feed every second, but should you? (Of course not)Take it as a challenge to make your reader as fancy as possible, use every trick in the book to optimise how it fetches content. Analyse the patterns of releasing new content per feed and adjust the fetch frequency based on that.And if you're building a reader for distribution, don't let the user set a refresh interval that doesn't make sense.

评论 #43266141 未加载

internetter3 months ago

> If you pull the feed, don't pull the posts. If you pull the posts, don't pull the feed. If you pull both, you're missing the whole point of having an aggregated feed!In some cases the reader should fetch both the feed and the pages. Unfortunately, none do<a href="https://github.com/miniflux/v2/issues/3084">https://github.com/miniflux/v2/issues/3084</a>