When imperfect systems are good: Bluesky's lossy timelines

785 点作者 cyndunlop3 个月前

42 条评论

pornel3 个月前

I wonder why timelines aren't implemented as a hybrid gather-scatter choosing strategy depending on account popularity (a combination of fan-out to followers and a lazy fetch of popular followed accounts when follower's timeline is served).When you have a celebrity account, instead of fanning out every message to millions of followers' timelines, it would be cheaper to do nothing when the celebrity posts, and later when serving each follower's timeline, fetch the celebrity's posts and merge them into the timeline. When millions of followers do that, it will be cheap read-only fetch from a hot cache.

评论 #43108664 未加载

评论 #43110207 未加载

评论 #43109007 未加载

评论 #43113811 未加载

评论 #43108812 未加载

ChuckMcM3 个月前

As a systems enthusiast I enjoy articles like this. It is really easy to get into the mindset of "this must be perfect".In the Blekko search engine back end we built an index that was 'eventually consistent' which allowed updates to the index to be propagated to the user facing index more quickly, at the expense that two users doing the exact same query would get slightly different results. If they kept doing those same queries they would eventually get the exact same results.Systems like this bring in a lot of control systems theory because they have the potential to oscillate if there is positive feedback (and in search engines that positive feedback comes from the ranker which is looking at which link you clicked and giving it a higher weight) and it is important that they not go crazy. Some of the most interesting, and most subtle, algorithm work was done keeping that system "critically damped" so that it would converge quickly.Reading this description of how user's timelines are sharded and the same sorts of feedback loops (in this case 'likes' or 'reposts') sounds like a pretty interesting problem space to explore.

评论 #43107018 未加载

评论 #43116290 未加载

评论 #43107888 未加载

评论 #43110527 未加载

评论 #43106982 未加载

评论 #43106334 未加载

评论 #43114706 未加载

rakoo3 个月前

Ok I'm curious: since this strategy sacrifices consistency, has anyone thoughts about something that is not full fan-out on reads or on writes ?Let's imagine something like this: instead of writing to every user's timeline, it is written once for each shard containing at least one follower. This caps the fan-out at write time to hundreds of shards. At read time, getting the content for a given users reads that hot slice and filters actual followers. It definitely has more load but- the read is still colocated inside the shard, so latency remains low- for mega-followers the page will not see older entries anywayThere are of course other considerations, but I'm curious about what the load for something like that would look like (and I don't have the data nor infrastructure to test it)

dsauerbrun3 个月前

I'm a bit confused.The lossy timeline solution basically means you skip updating the feed for some people who are above the number of reasonable followers. I get thatSeeing them get 96% improvements is insane, does that mean they have a ton of users following an unreasonable number of people or do they just have a very low number for reasonable followers. I doubt it's the latter since that would mean a lot of people would be missing updates.How is it possible to get such massive improvements when you're only skipping a presumably small % of people per new post?EDIT: nvm, I rethought about it, the issue is that a single user with millions of follows will constantly be written to which will slow down the fanout service when a celebrity makes a post since you're going through many db pages.

评论 #43112063 未加载

评论 #43113226 未加载

评论 #43112582 未加载

评论 #43116000 未加载

spoaceman77773 个月前

Hmm. Twitter/X appears to do this at quite a low number, as the "Following" tab is incredibly lossy (some users are permanently missing) at only 1,200 followed people.It's insanely frustrating.Hopefully you're adjusting the lossy-ness weighting and cut-off by whether a user is active at any particular time? Because, otherwise, applying this rule, if the cap is set too low, is a very bad UX in my experience x_x

评论 #43110145 未加载

rconti3 个月前

> Additionally, beyond this point, it is reasonable for us to not necessarily have a perfect chronology of everything posted by the many thousands of users they follow, but provide enough content that the Timeline always has something new.While I'm fine with the solution, the wording of this sentence led me to believe that the solution was going to be imperfect chronology, not dropped posts in your feed.

jadbox3 个月前

So, let's say I follow 4k people in the example and have a 50% drop rate. It seems a bit weird that if all (4k - 1) accounts I follow end up posting nothing in a day, that I STILL have a 50% chance that I won't see the 1 account that posts in a day. It seems to me that the algorithm should consider my feed's age (or the post freshness of my followers). Am I overthinking?

评论 #43110258 未加载

评论 #43120253 未加载

评论 #43113575 未加载

knallfrosch3 个月前

Anyone following hundreds of thousands of users is obviously a bot account scraping content. I'd ban them and call it a day.However, I do love reading about the technical challenge. I think Twitter has a special architecture for celebrities with millions of followers. Given Bluesky is a quasi-clone, I wonder why they did not follow in these footsteps.

评论 #43105931 未加载

评论 #43107119 未加载

评论 #43106395 未加载

评论 #43105970 未加载

评论 #43114438 未加载

评论 #43105739 未加载

评论 #43106942 未加载

sphars3 个月前

When I go directly to a user's profile and see all their posts, sometimes one of their posts isn't in my timeline where it should be. I follow less than 100 users on Bluesky, but I guess this explains why I occasionally don't see a user's post in my timeline.Lossy indeed.

评论 #43108490 未加载

评论 #43107321 未加载

cavisne3 个月前

AWS has a cool general approach to this problem (one badly behaving user effecting others on their shard)<a href="https://aws.amazon.com/builders-library/workload-isolation-using-shuffle-sharding/" rel="nofollow">https://aws.amazon.com/builders-library/workload-isolation-u...</a>The basic idea is to assign each user to multiple shards, decreasing the changes of another user sharing all their shards with the badly behaving user.Fixing this issue as described in the article makes sense, but if they did shuffle sharding in the first place it would cover any new issues without effecting many other users.

评论 #43108848 未加载

ultra-boss3 个月前

Love reading these sorts of "technical problem + solution" pieces. The world does not need more content, in general, but it does need more of this kind of quality information sharing.

artee_493 个月前

I am a bit perplexed though as to why they have implemented fan-out in a way that each "page" is blocking fetching further pages, they would not have been affected by the high tail latencies if they had not done this,"In the case of timelines, each “page” of followers is 10,000 users large and each “page” must be fanned out before we fetch the next page. This means that our slowest writes will hold up the fetching and Fanout of the next page."Basically means that they block on each page, process all the items on the page, and then move on to the next page. Why wouldn't you rather decouple page fetcher and the processing of the pages?A page fetching activity should be able to continuously keep fetching further set of followers one after another and should not wait for each of the items in the page to be updated to continue.Something that comes to mind would be to have a fetcher component that fetches pages, stores each page in S3 and publishes the metadata (content) and the S3 location to a queue (SQS) that can be consumed by timeline publishers which can scale independently based on load. You can control the concurrency in this system much better, and you could also partition based on the shards with another system like Kafka by utilizing the shards as keys in the queue to even "slow down" the work without having to effectively drop tweets from timelines (timelines are eventually consistent regardless).I feel like I'm missing something and there's a valid reason to do it this way.

评论 #43107648 未加载

ramblejam3 个月前

Nice problem to have, though. Over on Nostr they're finding it a real struggle to get to the point where you're confident you won't miss replies to your own notes, let alone replies from other people in threads you haven't interacted with.The current solution is for everyone to use the same few relays, which is basically a polite nod to Bluesky's architecture. The long-term solution is—well it involves a lot of relay hint dropping and a reliance on Japanese levels of acuity when it comes to picking up on hints (among clinets). But (a) it's proving extreme slow going and (b) it only aims to mitigate the "global as relates to me" problem.

arcastroe3 个月前

I found it odd to base the loss-factor on the number of people you follow, rather than a truer indication of timeline-update-frequency. What if I follow 4k accounts, but each of those accounts only posts once a decade? My timeline would be become unnecessarily lossy.

NoGravitas3 个月前

The funny thing is that all of the centralization in Bluesky is defended as being necessary to provide things like global search and all replies in a thread, things that Mastodon simply punts on in the name of decentralization. But then ultimately, Bluesky has to relax those goals after all.

评论 #43113033 未加载

skybrian3 个月前

This design makes sense if you didn’t previously have any limit on the number of people an account could follow. But why not have a limit?

评论 #43107363 未加载

nasso_dev3 个月前

Interesting! I wonder what value they chose for the `reasonable_limit`.

评论 #43106000 未加载

评论 #43114225 未加载

inportb3 个月前

An interesting solution to a challenging problem. Thank you for sharing it.I must admit, I had some trouble following the author's transition from "celebrity" with many followers to "bot" with many follows. While I assume the work done for a celebrity to scatter a bunch of posts would be symmetric to the work done for a commensurate bot to gather a bunch of posts, I had the impression that the author was introducing an entirely different concept in "Lossy Timelines."

thmrtz3 个月前

That’s quite interesting and a challenge I have not thought of. I understand the need for a solution and I believe this works reasonably well, but I am wondering what is happening to users that follow a lot of accounts with below-average activity. This may naturally happen on new social media platforms with people trying out the service and possibly abandoning it.The „reasonable limit“ is likely set to account for such an effect, but I am wondering if a per-user limit based on the activity of the accounts one follows will be an improvement on this approach.

fastest9633 个月前

To help avoid the hot shard problem, I wonder how capping followers per "timeline" would perform. Especially each user would have a separate timeline per 1000 followers and the client would merge them. You could still do the lossy part, if necessary, by only loading a percent of the actual timelines. That wouldn't help the celebrity problem but it was already acknowledged earlier that the solution to that is to not fan out celebrity accounts.

Artoooooor3 个月前

Are users informed that they follow too many creators and now they will not see every post on their timelines?

buxidao3 个月前

In the fanout design, why not dynamically move on to the next 10,000-user page as soon as all tasks for the current page are either queued or processing? Would that approach improve throughput, or could it introduce issues like resource contention?

trhway3 个月前

So the system design puts the burden on what seems to be synchronous, not queued, writes to get easy reads. I usually prefer simpler cheaper writes at the cost of more complicated reads as the reads scale and parallelize better.

评论 #43107302 未加载

crabbone3 个月前

Anecdotally, I ran into a similar solution "by chance".Long ago, I worked for a dating site. Our CTO at the time was a "guest of honor" who was brought in by a family friend who was working in the marketing at the time. The CTO was a university professor who took on a job as a courtesy (he didn't need the money nor fame, he had enough of both, and actually liked teaching).But he instituted a lot of experimental practices in the company. S.a. switching roles every now and then (anyone in the company could apply for a different role except administration and try themselves wearing a different hat), or having company-wide discussions of problems where employees would have to prepare a presentation on their current work (that was very unusual at the time, but the practice became more institutional in larger companies afterwards).Once he announced a contest for the problem he was trying to solve. Since we were building a dating site, the obvious problem was matching. The problem was that the more properties there were to match on, the longer it would take (beside other problems that is). So, the program was punishing site users who took time to fill out the questionnaires as well as they could and favored the "slackers".I didn't have any bright ideas on how to optimize the matching / search for matches. So, ironically, I asked "what if we just threw away properties beyond certain threshold randomly?" I was surprised that my idea received any traction at all. And the answer was along the lines of "that would definitely work, but I wouldn't know how to explain this behavior to the users". Which, at the time, I took to be yet another eccentricity of the old man... but hey, the idea stuck with me for a long time!

评论 #43110236 未加载

flaburgan3 个月前

The solution to this problem is known and implemented already: the social web should be distributed between thousands of pods which should contain at the maximum a few thousands users. Diaspora is already working like this for 15 years. It is technically harder to build initially but it then divide all the problems (maintenance, moderation, load, censorship, trust of the owner...) Which makes the network much more resilient. Bluesky knows that and they are allowing other people to host their software but they are really not pushing for it and it highly doubt that the experience of a user on a small external pod is the same than on bluesky.com.

评论 #43112191 未加载

mpweiher3 个月前

On a related note, I am pretty confident that one of the main reasons the WWW succeeded where previous attempts failed was that it very specifically allowed 404s.

KolmogorovComp3 个月前

A simpler option is to put a limit on the number of accounts one’s can follow. Who needs to follow more than 4k followers if not bots?

udioron3 个月前

> some of them will do abnormal things like… well… following hundreds of thousands of other users.Sounds like Bluesky Pro.

yibg3 个月前

I think something like this was a FB engineering interview (several years ago), just for instagram feeds.

JadeNB3 个月前

I understand that it's a different point, but how can someone write a whole essay called "When imperfect systems are good" without once mentioning Gabriel or <a href="https://en.wikipedia.org/wiki/Worse_is_better" rel="nofollow">https://en.wikipedia.org/wiki/Worse_is_better</a>?

robbale3 个月前

the use of fan-out to followers and a lazy fetch of popular followed accounts when follower's timeline is served a good implementations in hot reload scenarios

dtonon3 个月前

The typical problem of a centralized infrastructure.Indeed:> This means each user gets their own Timeline partition, randomly distributed among shards of our horizontally scalable database (ScyllaDB), replicated across multiple shards for high availability

Nemo_bis3 个月前

"Lossy timelines" have already been implemented in ActivityPub and Mastodon by design. Will Bluesky ever catch up? It remains to be seen.

andsoitis3 个月前

Principle: Progress over perfection.

nightpool3 个月前

Note that all of this reflects design decisions on Bluesky's closed-source "AppView" server—any federated servers interacting with Bluesky would need to construct their own timelines, and do not get the benefit of the work described here.

评论 #43105578 未加载

评论 #43105526 未加载

评论 #43105263 未加载

评论 #43106325 未加载

timewizard3 个月前

> This process involves looking up all of your followers, then inserting a new row into each of their Timeline tables in reverse chronological order with a reference to your post.Seriously? Isn't this the nut of your problem right here?

评论 #43107192 未加载

PaulHoule3 个月前

An airline reservation system has to be perfect (no slack in today's skies), a hotel reservation can be 98% perfect so long as there is some slack and you don't mind putting somebody up in a better room than they paid for from time to time.A social media system doesn't need to be perfect at all. It was clear to me from the beginning that Bluesky's feeds aren't very fast, not like they are crazy slow, but if it saves money or effort it's no problem if notifications are delayed 30s.

评论 #43107238 未加载

评论 #43107181 未加载

评论 #43107149 未加载

评论 #43109305 未加载

评论 #43108159 未加载

bitmasher93 个月前

It’s really impressive how well Bluesky is performing. It really feels like a throwback to older social media platforms with its simplicity and lack of dark-patterns. I’m concerned that all the great work on the platform, protocol, etc won’t shine in the long term as they eventually need to find a revenue source.

评论 #43106412 未加载

评论 #43105695 未加载

评论 #43105649 未加载

mifydev3 个月前

"Hot Shards in Your Area" - 10/10 heading

dang3 个月前

[stub for offtopicness]

评论 #43105271 未加载

评论 #43105601 未加载

评论 #43105315 未加载

评论 #43105758 未加载

评论 #43107558 未加载

评论 #43108018 未加载

cush3 个月前

"Hot Shards in Your Area"... I died

alexnewman3 个月前

I don’t see much call for blusky anymore….

评论 #43115348 未加载

评论 #43135626 未加载