I've been thinking about this a lot - nearly every problem these days is a synchronisation problem. You're regularly downloading something from an API? Thats a sync. You've got a distributed database? Sync problem. Cache Invalidation? Basically a sync problem. You want online and offline functionality? sync problem. Collaborative editing? sync problem.<p>And 'synchronisation' as a practice gets very little attention or discussion. People just start with naive approaches like 'download whats marked as changed' and then get stuck in the quagmire of known problems and known edge cases (handling deletions, handling transport errors, handling changes that didn't get marked with a timestamp, how to repair after a bad sync, dealing with conflicting updates etc).<p>The one piece of discussion or attempt at a systematic approach I've seen to 'synchronisation' recently is to do with Conflict-free Replicated Data Types <a href="https://crdt.tech" rel="nofollow">https://crdt.tech</a> which is essentially restricting your data and the rules for dealing with conflicts to situations that are known to be resolvable and then packaging it all up into an object.
I'm not convinced that there is one generalised solution to sync engines. To make them truly performant at large scale, engineers need to have deep understanding of the underlying technology, their query performance, database, networking, and build a custom sync engine around their product and their data.<p>Abstracting all of this complexity away in one general tool/library and pretending that it will always work is snake oil. There are no shortcuts to building truly high quality product at a large scale.
> <i>decoupled from the horrors of an unreliable network</i><p>The first rule of network transparency is: the network is not transparent.<p>> <i>Or: I’ve yet to see a code base that has maintained a separate in-memory index for data they are querying</i><p>Is boost::multi_index_container no longer a thing?<p>Also there's SQLite with the :memory: database.<p>And this ancient 4gl we use at work has in-memory tables (as in database tables, with typed columns and any number of unique or not indexes) as a basic language feature.
> have a theory that every major technology shift happened when one part of the stack collapsed with another.<p>If that was true, we would ultimately end up with a single layer. Instead I would say that major shifts happen when we move the boundaries between layers.<p>The author here proposes to replace servers by synced client-side data stores.<p>That is certainly a good idea for some applications, but it also comes with drawbacks. For example, it would be easier to avoid stale data, but it would be harder to enforce permissions.
Honourable mentions of some more excellent fully open-source sync engines:<p>- Zero Sync: <a href="https://github.com/rocicorp/mono" rel="nofollow">https://github.com/rocicorp/mono</a><p>- Triplit: <a href="https://github.com/aspen-cloud/triplit" rel="nofollow">https://github.com/aspen-cloud/triplit</a>
> decoupled from the horrors of an unreliable network<p>There's no such thing as reliable network in the world. The world is network connected, there's almost no local-only systems anymore (for a long long time now).<p>Some engineers dream that there's some cases when network is reliable, like when a system fully lives in the same region and single AZ. But even then it's actually not reliable and can have some glitches quite frequently (like once per month or so, depending on some luck).
Locally synced databases seem to be a new trend. Another example is Turso, which works by maintaining a sort of SQLite-DB-per-tenant architecture. Couple that with WASM and we’ve basically come full circle back to old school desktop apps (albeit with sync-on-load). Fat client thin client blah blah.
Lotus Notes was a product far ahead of its time (nearly forgotten today) which was an object database with synchronization semantics. They made a lot of decisions that seem really strange today, like building an email system around it, but that empowered it for long-running business workflows. It's something everybody in the low-code/no-code space really needs to think about.
This is also a tricky UI problem. Live updates, where web pages move around on you while you’re reading them, aren’t always desirable. When you’re collaborating with someone you know on the same document, you want to see edits immediately, but what about a web forum? Do you really need to see the newest responses, or is this a distraction? You might want a simple indicator that a reload will show a change, though.<p>A white paper showing how Instant solves synchronization problems might be nice.
I'm surprised to see Tonsky here<p>Mostly because I consider the state of the art on this to be Clojure Electric and he presumably is aware of it at least to some degree but does not mention it
Have been using Instant for a few side projects recently and it has been a phenomenal experience. 10/10, would build with it again. I suspect this is also at least partially true of client-server sync engines in general.
If sync really is the future, do you think devs will finally stop pretending local-first apps are some niche thing and start building around sync as the core instead of the afterthought? Or are we doomed to another decade of shitty conflict resolution hacks?
IPFS is a technology very helpful for syncing. One way it's being used in a modern context (although only sub-parts of IPFS stack) is how BlueSky engineers, during their design process a few years ago, accepted my proposal that for a new Social Media protocol, each user should have his own "Repository" (Basically a Merkel Tree) of everything he's ever posted. Then there's just a "Sync" up to some master service provider node (decentralized set of nodes/servers) for the rest of the world to consume.<p>Merkel-Tree based synching is as performant as you can possibly get (used by Git protocol too I believe) because you can tell of a root of a tree-structure is identical to some other remote tree structure just by comparing the Hash Strings. And this can be recursively applied down any "changed branches" of a tree to implement very fast syncing mechanisms.<p>I think we need a NEW INTERNET (i.e. Web3, and dare I say Semantic Web built in) where everyone's basically got their own personal "Tree of Stuff" they can publish to the world, all naively built into some new kind of tree structure-based killer app. Like imagine having Jupyter Notebooks in Tree form, where everything on it (that you want to be) is published to the web.
Discussion of sync engines typically goes hand in hand with local-first software. But it seems to be limited to use cases when the amount of data is on the smaller side. For example, can anyone imagine how there might be a local-first version of a recommendation algorithm (I'm thinking something TikTok-esque)? This would be a case where the determination of the recommendation relies on a large amount of data.<p>Or think about any kind of large-ish scale enterprise SaaS. One of the clients I'm working with currently sells a Transportation Management Software system (think logistics, truck loads, etc). There are very small portions of the app that I can imagine relying on a sync engine, but being able to search over hundreds of thousands of truck loads, their contents, drivers, etc seems like it would be infeasible to do via a sync engine.<p>I mention this because it seems that sync engines get a lot of hype and interest these days, but they apply to a relatively small subset of applications. Which may still be a lot, but it's a bit much to say they're the future (I'm inferring "of application development"--which is what I'm getting from this article).
<i>> Such a library would be called a database. But we’re used to thinking of a database as something server-related, a big box that runs in a data center. It doesn’t have to be like that! Databases have two parts: a place where data is stored and a place where data is delivered. That second part is usually missing.</i><p>Yes! A thousand times this!<p>Databases can't just "live on a server somewhere", their code should <i>extend</i> into the clients. The client isn't just a network protocol parser / serialiser, it should implement what is essentially an untrusted, read-only replica. For writes, it should implement what is essentially a local write-ahead log (WAL) either in-memory and optionally fsync-d to local storage. All of this should use the <i>same</i> codebase as the database engine, or machine-generated in multiple languages from some sort of formal specification.
The problem I have with "moving the database to the client" is the same one I have in practice with CRDTs: In my apps, I need to preserve the history of changes to documents, and I need to validate and authenticate based on high-level change descriptions, not low-level DB access.<p>This always leads me back to operational transforms. Operations being reified changes function as undo records; a log of changes; and a narrower, semantically-meaningful API, amenable to validation and authz.<p>For the Roam Firebase example: this only works if you can either trust the client to always perform valid actions, or you can fully validate with Firebase's security rules.<p>OT has critiques, but almost all of the fall away in my experience when you have a star topology with a central service that mediates everything - defining the canonical order of operations, performs validation & auth, and records the operation log.
The largest feature my team develops is a sync engine. We have a distributed speech assistant app (multiple embeddeds [think car and smartphone] & cloud) that utilizes the Blackboard pattern. The sync engine keeps the blackboards on all instances in sync.<p>It is based on gRPC and uses a state machine on all instances that transitions through different states for connection setup, "bulk sync", "live sync" and connection wind down.<p>Bulk sync is the state that is used when an instance comes online and needs to catch up on any missed changes. It is also the self-heal mechanism if something goes wrong.<p>Unfortunately some embedded instances have super unreliable clocks that drift quite a bit (in both directions). We consider switching to a logical clock.<p>We have quite a bit of code that deals with conflicts.<p>I inherited this from my predecessor. Nowadays I would probably not implement something like this again, as it is quite complex.
I designed the sync engine for Things Cloud [0] over a decade ago. It seems to have worked out pretty well for them. (The linked page has some details about what it can do.)<p>When sync Just Works™, it's a magical thing.<p>One of the reason's my design has been reliable from its very first release, even across multiple refactors/rewrites (I believe it's currently on its third, this time to Swift) is that it uses a Git-like model internally with pervasive hashing. It's almost impossible for sync to work incorrectly (if it works at all).<p>[0] <a href="https://culturedcode.com/things/cloud/" rel="nofollow">https://culturedcode.com/things/cloud/</a>
Probably a silly question, but if you take this all the way and treat everything as a DB that is synchronized in the background, how do you manage access control where not every user/client is supposed to have access to every object represented in the DB? Where does that logic go?
If you do it on the document level like figma or canvas, every document is a DB and you sync the changes that happen to the document but first you need access to the document/DB. But doesn't this whole idea break apart if you need to do access control on individual parts of what you treat as the DB because you would need to have that logic on the client which could never be secure...
How do sync engines address issues where we need something to be more dynamic? Currently I'm building a language learning app and we need to display your "learning path" - what lessons you have finished and what are your next lessons. The next lessons aren't fixed/same for everyone. It will change depending on how the score of completed lessons. Is any query language dynamic enough to support use cases like this? Or is it expected to recalculate the next lessons whenever the user completes a lesson and write it out to a table which can then be queried easily?
so... what do people that want to have sync engines do?<p>I want to try it for hobby project and I think I will go the route of just one way sync (from database to clients) using electric sql and I will have writes done in a traditional way (POST requests).<p>I like the idea of having server db and local db in sync, but what happens with writes? I know people say CRDT etc... but they are solving conflicts in unintuitive ways...<p>I know I probably sound uneducated, but I think the biggest part of this is still solving conflicts in a good way, and I don't really see how you can solve those in a way that works for all different domains and have it "collapsed" as the author says
Maybe I am just dumb but I really cannot see how data synch could solve what (in my kind of business) is a real problem.<p>Example: you develop a web app to book for flights online.<p>My browser points to it and I login.
Should synchronization start right now? Before I even input my departure point and date?<p>Ok, no. I write NYC -> BER, and a dep date.<p>Should I start synching now?<p>Let's say I do. Is this really more efficient than querying a webservice?<p>Ok, now all data are synched. Even potentially the ones for business class, even if I just need economy.<p>You kniw, I could always change my mind later. Or find out that on the day I need to travel no economy seats are available anymore.<p>Whatever. I have all the inventory data that I need. Raw.<p>Guess what? As a LH frequent flyer I get special treatment in terms of price. Not just for LH, but most Business Alliance airlines.<p>This logic is usually on the server, because airlines want maximum creativity and flexibility in handling inventory.<p>Should we just synch data and make the offer selection algorithm run on the webserver instead?<p>Let's say it does not matter... I have somehow in front of me all the options for my trip. So I call my wife to confirm she agrees with my choice. I explain her the alternatives... this takes 5 minutes.<p>In this period, 367 other people are buying/cancelling trips to Europe. So I either see my selection constantly change (yay! Synchronization!!!) or I press confirm, and if my choice is gine I get a warning message and I repeat my query.<p>Now add two elements:
- airlines prefer not to show real numbers of available seats - they will usually send you a single digit from 1 to 9 or a "*" to mean "10 or more".<p>So just symching raw data and let the combinatorial engine work in the browser is not a very good idea.<p>Also, I see the pontential to easily mount DDOS attacks if every client is constantly being synchronized by copying high contention tables in RT.<p>What am I missing here?
The local first people (<a href="https://localfirstweb.dev/" rel="nofollow">https://localfirstweb.dev/</a>) have some cool ideas about how to solve the data synch problem. Check it out.
The problem with sync engines is needing full-stack buy-in in order for it to work properly. Having a separate backend-for-frontend service defeats the purpose in my mind. So what do you do when a company already has an API and other clients beyond a web app? The web app has to accommodate. I see this as the major downside with sync engines.<p>I've been using `starfx` which is able to "sync" with APIs using structured concurrency: <a href="https://github.com/neurosnap/starfx" rel="nofollow">https://github.com/neurosnap/starfx</a>
I think an underappreciated library in this space is Logux [1]<p>It requires deeper (and more) integration work compared to solutions that sync your state for you, but is a lot more flexible wrt. the backend technology choices.<p>At its core, it is an action synchronizer. You manage both your local state and remote state through redux-style actions, and the library takes care of syncing and resequencing them (if needed) so that all clients converge at the same state.<p>[1] <a href="https://logux.org/" rel="nofollow">https://logux.org/</a>
The author would be excited to learn that CouchDB solves this problem since 20 years.<p>The use case the article describes is exactly the idea behind CouchDB: a database that is at the same time the server, and that's made to be synced with the client.<p>You can even put your frontend code into it and it will happily serve it (aka CouchApp).<p><a href="https://couchdb.apache.org" rel="nofollow">https://couchdb.apache.org</a>
Idk man. It's a nice idea, but it has to be 10x better than what we currently have to overcome the ecosystem advantages of the existing tech. In practice, people in the frontend world already use Apollo/Relay/Tanstack Query to do data caching and querying, and don't worry too much about the occasional overfetching/unoptimized-ness of the setup. If they need to do a complex join they write a custom API endpoint for it. It works fine. Everyone here is very wary of a "magic data access layer" that will fix all of our problems. Serverless turned out to be a nightmare because it only partially solves the problem.<p>At the same time, I had a great time developing on Meteorjs a decade ago, which used Mongo on the backend and then synced the DB to the frontend for you. It was really fluid. So I look forward to things like this being tried. In the end though, Meteor is essentially dead today, and there's nothing to replace it. I'd be wary of depending so fully on something so important. Recently Faunadb (a "serverless database") went bankrupt and is closing down after only a few years.<p>I see the product being sold is pitched as a "relational version of firebase", which I think good idea. It's a good idea for starter projects/demos all the way up to medium-sized apps, (and might even scale further than firebase by being relational), but it's not "The Future" of all app development.<p>Also, I hate to be that guy but the SQL in example could be simpler, when aggregating into JSON it's nice to use a LATERAL join which essentially turns the join into a for loop and synthesises rows "on demand":<p><pre><code> SELECT g.*,
COALESCE(t.todos, '[]'::json) as todos
FROM goals g
LEFT JOIN LATERAL (
SELECT json_agg(t.*) as todos
FROM todos t
WHERE t.goal_id = g.id
) t ON true
</code></pre>
That still proves the author's point that SQL is a very complicated tool, but I will say the query itself looks simpler (only 1 join vs 2 joins and a group by) if you know what you're doing.
Why he haven't implemented a full Datomic Peer for his DataScript I never understood.<p>Having a datalog query engine, supplying it with data from Datomic indexes - b-tree like collections storing entity-attribute-value records - seems simple. Updating the local index cache from log is also simple.<p>And that gets you a db in browser.
If anyone could be kind to give feedback on the local-first x data ownership db we're building, would really appreciate it! <a href="https://docs.basic.tech/" rel="nofollow">https://docs.basic.tech/</a><p>Will do my best to take action on any feedback I receive here
We have had interest in using our serverless stream API (<a href="https://s2.dev/" rel="nofollow">https://s2.dev/</a>) to power sync engines. Very excited about these kinds of use cases, email in profile if anyone wants to chat.
I found it quite disappointing to find a marketing piece from Nikki.<p>It is full of general statements that are only true for a subset of solutions.
Enterprise solutions in particular are vastly more complex and can't be magically made simple by a syncing database.
(no solution comes even close to "99% business code". Not unless you re-define what business code is)<p>It is astounding how many senior software engineers or architects don't understand that their stack contains multiple data models and even in a greenfield project you'll end up with 3 or more.
Reducing this to one is possible for simple cases - it won't scale up.
(Rama's attempt is interesting and I hope it proves me wrong)<p>From: "yeah, now you don't need to think about the network too much" to "humbug, who even needs SQL"<p>I've seen much bigger projects fail because they fell for one or both of these ideas.<p>While I appreciate some magic on the front-end/back-end gap, being explicit (calling endpoints, receiving server-side-events) is much easier to reason about.
If we have calls failing, we know exactly where and why.
Sprinkle enough magic over this gap and you'll end up in debugging hell.<p>Make this a laser focused library and I might still be interested because it might remove actual boilerplate.
Turn it into a full-stack and your addressable market will be tiny.
> Such a library would be called a database.<p>bold of them to assume a database can manage even the most trivial of conflicts.<p>There's a reason you bombard all your writes to a "main/master/etc"
The future of webapps: wasm in the browser, direct SQL for the API.<p>Main problem? No result caching but that's "just" a middleware to implement.
I recently took a part time role at Oracle Labs and have been learning PL/SQL as part of a project. Seeing as Niki is shilling for his employer, perhaps it's OK for me to do the same here :) [1]. HN discourse could use a bit of a shakeup when it comes to databases anyway. This may be of only casual interest to most readers, but some HN readers work at places with Oracle licenses and others might be surprised to discover it can be cheaper than an AWS managed Postgres [2].<p>It has a couple of features relevant to this blog post.<p>The first: Niki points out that in standard SQL producing JSON documents from relational tables is awkward and the syntax is terrible. This is true, so there's a better syntax:<p><pre><code> CREATE JSON RELATIONAL DUALITY VIEW dept_w_employees_dv AS
SELECT JSON {'_id' : d.deptno,
'departmentName' : d.dname,
'location' : d.loc,
'employees' :
[ SELECT JSON {'employeeNumber' :e.empno,
'name' : e.ename}
FROM employee e
WHERE e.deptno = d.deptno ]
}
FROM department d WITH UPDATE INSERT DELETE;
</code></pre>
It makes compound JSON documents from data stored relationally. This has three advantages: (1) JSON documents get materialized on demand by the database instead of requiring frontend code to do it, (2) the ORDS proxy server can serve these over HTTP via generic authenticated endpoints (e.g. using OAuth or cookie based auth) so you may not need to write any code beyond SQL to get data to the browser, and (3) the JSON documents produced can be written to, not only read.<p>The second feature is query change notifications. You can issue a command on a connection that starts recording the queries issued on it and then get a callback or a message posted to an MQ when the results change (without polling). The message contains some info about what changed. So by wiring this up to a web socket, which is quite easy, the work of an hour or two in most web frameworks, then you can stream changes to the client directly from the database without needing much logic or third party integrations. You either use the notification to trigger a full requery and send the entire result json back to the browser, or you can get fancier and transform the deltas to json subsets.<p>It'd be neat if there was a way to join these two features together out of the box, but AFAIK if you want full streaming of document deltas to the browser and reconstituting them there, it would need a bit more on top.<p>Again, you may feel this is irrelevant because doesn't every self-respecting HN reader use Postgres for everything, but it's worth knowing what's out there. Especially as the moment you decide to paying a cloud for hosting your DB you have crossed the Rubicon anyway (all the hosted DBs are proprietary forks of Postgres), so you might as well price out alternatives.<p>[1] and you know the drill, views are my own and nobody has reviewed this post.<p>[2] <a href="https://news.ycombinator.com/item?id=42855546">https://news.ycombinator.com/item?id=42855546</a>
> I’ve yet to see a code base that has maintained a separate in-memory index for data they are querying<p>Define "separate" but my old X11 compositor project neocomp I did something like that with a series of AOS arrays and bitfields that combined to make a sort of entity manager. Each index in the arrays was an entity, and each array held a data associated with a "type" of entity. An entity could hold multiple types that would combine to specify behavior. The bitfield existed to make it quick to query.<p>It waaay too complicated for what it was, but it was fun to code and worked well enough. I called it a "swiss" (because it was full of holes). It's still online on github (<a href="https://github.com/DelusionalLogic/NeoComp/blob/master/src/swiss.h" rel="nofollow">https://github.com/DelusionalLogic/NeoComp/blob/master/src/s...</a>) even though I don't use it much anymore.
I've always wondered, how do applications with more stringent security requirements handle this?<p>Assume that permissions to any row in the DB can be removed at any time. If we store the data offline, this security measure is already violated. If you don't care about a user potentially storing data they no longer have access to, when they come online, any operations they make are invalid and that's fine<p>But, if security access is part of your business logic, and is complex enough to the point where it lives in your app and not in your DB (other than using DB tools like RLS), how do you verify that the user still has access to all cached data? Wouldn't you need to re-query every row every time?<p>I'm still uncertain how these sync engines can be secured properly
Sync, in general, is a very complex topic. There are past examples, such as just trying to sync contacts across different platforms where no definitive solution emerged. One fundamental challenge is that you can’t assume all endpoints behave fairly or consistently, so error propagation becomes a core issue to address.<p>Returning to the contacts example, Google Contacts attempts to mitigate error propagation by introducing a review stage, where users can decide how to handle duplicates (e.g., merge contacts that contain different information).<p>In the broader context of sync, this highlights the need for policies to handle situations where syncing is simply not possible beyond all the smart logic we may implement.
didn't know that about roam research. I was a user, but also that app convinced me that front-end went in the wrong direction for a decade...<p>Rocicorp Zero Sync, instantdb, linear app like trend is great -- sync will be big. I hope a lot of the spa slop gets fixed!
TL/DR:<p>> If your database is smart enough and capable enough, why would you even need a server? Hosted database saves you from the horrors of hosting and lets your data flow freely to the frontend.<p>(this is a blog of one such hosted database provider)
I've solved data sync in distributed apps long time ago. I send outgoing data to /dev/null and receive incoming data from /dev/zero. This way data is always consistent. That also helps with availability and partion tolerance.