Wow this is an oversimplification. I've had years of experience working in a data lake within a FAANG handling > 5 PBs of data per day ingest. There's so many things this misses:<p>1. What if the domain teams don't actually care to maintain data quality or even care about sharing data in the first place? This model requires every data producer to maintain a relationship with every data consumer. That's not gonna happen in a large company.<p>2. Who pays for query compute and data storage when you're dealing with petabytes and petabytes of data from different domains? If you (the data platform team) bill the domain teams then see above, they'll just stop sending data.<p>3. Just figuring out what data exists in the data mart (which this essentially is describing) is a hassle and slows down business use cases, especially when you have 1000s of datasets. You need a team to act as sort of a "reference librarian" to help those querying data. You can't easily decentralize this.<p>4. How do you get domain teams to produce data in a form that is easy to query? Like what if they write lots of small files that are computationally difficult to query, whose gonna advise them? Data production is very related to data query performance at TBs scale. The domain team is not gonna become experts or care.<p>5. What do you do when a domain team has a lot of important data but no engineering resources? Do you just say "oh well, we're just a self-service data platform so no one gets to access the data"?
It really feels like data mesh is a fairly half baked concept born out of short term consulting gigs and a desire to become a technical thought leader.
Is there an underlying assumption here that all of the datasets' domains are perfectly in sync with each other in the context of domain metadata?<p>As an example, a Team1 might define the manufacturer of a Sprocket as the company that assembled it, whereas a Team2 might define the manufacturer as the company that built the Sprocket's engine. Since the purpose of a datamesh is to enable other teams to perform cross-domain data analytics, there needs to be reconciliation regarding these definitions, or it'll become a datamess. Where does that get resolved?
I looks like a weird attempt to build a consulting business around a simple idea.<p>Treat data assets like micro services and pipelines like network. Period.<p>Prescribing everything else rubs me wrong way.<p>So, data mesh is: architecture in which data in the company organized in loosely coupled data assets.
So if I understand this correctly, data mesh is just data mart, that doesn't bring data in database as a table, but uses S3 storage instead (I assume because thats cheaper in the cloud?)
The concept of a data-mesh is more of a business concept as opposed to tech. IMHO the idea being proposed is that of a conceptual data-server (not to be confused with database server) much like a HTTP server / Mail Server where people can engage with data as a first class citizen and create "data" products. This is especially true as we move from HTML to somewhat HDML (Hyper data markup).<p>By making data as the product (abstracting all the gory details), you are fundamentally engaging with data through a UI or an API. As you expose these products they become accretive while fundamentally encapsulating the domain expertise within them.
This seems like mostly common sense. Infrastructure teams should always be building tools that the org consumes (and ideally the general public)<p>In a lot of orgs this goes sideways and the infrastructure teams end up owning everything and never have time to do anything else. Usually this happens due to upper management putting on the squeeze.<p>In order for teams to actually own their infrastructure and data we need better tooling to help them. This is coming along nowadays but isn’t fully there.
Dunno about the merits of this, but it does seem to be part of the overall effort to rethink how to organize large groups of people working together. With the internet we can afford peer-to-peer communication, and we don't <i>have</i> to organize into hierarchies. But we can't just do full-mesh communication either, because that's overwhelming to individuals, as anyone who lived through the initial slack-and-zoom remote work of early 2020 can tell you. (Though lots of people are <i>still</i> living through it, unfortunately)<p>So what kind of communication structures are good, and in what circumstances? How do we structure work so that we don't have to communicate about <i>everything</i>? When do we fall back to ad-hoc video chat or even in-person meetings? These are the kinds of questions that 21st-century management has to answer. It's fascinating to watch people grapple with them.
Lots of concerns and scepticism in the discussions here. Any suggestions about good, achievable data strategies and data architecture that work at enterprise level?
It sounds almost entirely about team responsibility and governance, rather than technical architecture. What’s the difference from a data lake on a technical level?
Isn't this usually called a "data mart" as opposed to "data mesh"? Or is the "mesh" term intended to point to something more unstructured, like team- or business division-level equivalent to a data lake? But isn't that just a data pond?