A Data Pipeline Is a Materialized View

144 点作者 nchammas大约 4 年前

10 条评论

Great article, one quibble: there isn’t really a clear dividing line between batch and streaming. If you process data one row at a time, that is clearly a streaming pipeline, but most systems that call themselves streaming actually process data in small batches. From a user perspective, it’s an implementational detail, the only thing you care about is the latency target.Nearly all data sources, including the changelogs of databases, are polling-based APIs, so you’re getting data from the source in (small) batches. If your goal is to put this data into a data warehouse like Snowflake, or a system like Materialize, the lowest latency thing you can do is just immediately put that data into the destination. I sometimes see people put a message broker like Kafka in the middle of this process, thinking it’s going to imbue the system with some quality of streamyness, but this can only add latency. People are often surprised that we don’t use a message broker at Fivetran, but when you stop and think about it there’s just no benefit in this context.

评论 #26218859 未加载

评论 #26221540 未加载

评论 #26232341 未加载

评论 #26219128 未加载

评论 #26222669 未加载

评论 #26220640 未加载

082349872349872大约 4 年前

Sometime when I am old(er) and (somehow?) have more time, I'd like to jot down a "Rosetta Stone" of which buzzwords map to the same concepts. So often we change our vocabulary every decade without changing what we're really talking about.Things started out in a scholarly vein, but the rush of commerce hasn't allowed much time to think where we're going. — James Thornton, Considerations in computer design (1963)

评论 #26224836 未加载

mjdrogalis大约 4 年前

As someone who’s spent a lot of time working on data pipelines, I think this is a great breakdown of the complexity most data engineers are facing. However, I think there’s two more keys to tidying up messy pipelines in practice:1. You need to colocate both stream processing for the data pipeline and real-time materialized view serving for the results.2. You need one paradigm for expressing both of these things.Let me try to describe a bit why that is.1. You virtually always need both stream processing and view serving in practice. In the real-world, you ingest data streams from across the company and generally don’t have a say about how the data arrives. Before you can do the sort of materialization the author describes, you need to rearrange things a bit.2. Building off of (1), if these two aren’t conceptually close, it becomes hard to make the whole system hang together. You still effectively have the same mess—it’s just spread over more components.This is something we’re working really hard on solving at Confluent. We build ksqlDB (<a href="https://ksqldb.io/" rel="nofollow">https://ksqldb.io/</a>), an event streaming database over Kafka that:1. Let’s you write programs that do stream processing and real-time materialized views in one place.2. Let’s you write all of it in SQL. I see a lot of people on this post longing for bash scripting, and I get it. These frameworks are way too complicated today. But to me, SQL is the ideal medium. It’s both concise and deeply expressive. Way more people are competent with SQL, too.3. Has built-in support for connecting to external systems. One other, more mundane part of the puzzle is just integrating with other systems. ksqlDB leverages the Kafka Connect ecosystem to plug into 120+ data systems.You can read more about how the materialization pieces works in a recent blog I did. <a href="https://www.confluent.io/blog/how-real-time-materialized-views-work-with-ksqldb/" rel="nofollow">https://www.confluent.io/blog/how-real-time-materialized-vie...</a>

CapriciousCptl大约 4 年前

As someone who basically sticks everything possible into Postgres, this is interesting! Streaming tools don't automatically cache things you need? I guess it's about time they do! Postgres, for instance, has a robust LRU mechanism that deals with OLTP quite competently. OLAP too if your indices are thought-out.Also, although built-in materialized views don't allow partial updates in Postgres, you can get a similar thing with normal tables and triggers. Hashrocket discussed that strategy here-- <a href="https://hashrocket.com/blog/posts/materialized-view-strategies-using-postgresql" rel="nofollow">https://hashrocket.com/blog/posts/materialized-view-strategi...</a> .

评论 #26219345 未加载

snidane大约 4 年前

Most problems of data engineering of today would be solved in presence of a tool in which I would define arbitrary transformation of a say a single daily data increment and the system would handle the state management and loading of all of the increments. Regardless of if they came from source updates or backfills.Data engineering really is just a maintenance of incrementally updated materialized views, but no tool out there yet recognizes it. They at best help you orchestrate and parallelize your ETLs across multiple threads and machines. They become glorified makefiles at the cost of introducing several layers of infrastructure into the picture (eg. Airflow) for what should have been solved by simple bash scripting.Yet at best these tools only help with stateless batch processing. When it comes down to stateful processing, which is necessary for maintaining an incrementally updated materialized views and idempotent loads, I have to couple the logic of view state management (what has been loaded so far) with logic of the actual data transformation.Response to difficulties of batch ETL from the industry is usually: batch data processing systems are resource hungry and slow, all you need from now is streaming.No, actually I don't. For data analytics, pure streaming almost has no application. Data analytics is essentially data compression of big data to something smaller. Ie. some form of group by. I have to wait for a window of data to get close before computing anything useful. Analytics on real "real time" data on unclosed windows is confusing and useless.So all data analytics will ever run on groups, windows and batches of data. Therefore I need a system which will help me run data transformations on batches. More precisely - stream of smaller batches. I need this to react to incoming daily, hourly or minutely batches and I need this to backfill my materialized view in the case I decide to wipe it off and start again.You can literally do this in what was supposed to be the original system to orchestrate bunch of programs - shell scripting. And you'll be happier for it than using current complex frameworks. Only things you will miss is something to run distributed cron and to distribute load to multiple machines. At least the latter can be handled by gnu parallel.This article hits the nail on its head with describing what conceptual model for ETL actually is and once others will follow, we might finally see new frameworks or just libraries to help us to greatly simplify ETLs. Perhaps one day data engineering will be just as simple as running an idempotent bash or python or sql script or even close to nonexistent.

评论 #26221994 未加载

评论 #26221971 未加载

smknappy大约 4 年前

Great post! Just heard about this from one of our customers who slacked me with "He is describing Ascend.io!" :-)Having spent 15+ years writing big data pipelines and building teams who do the same, I couldn't agree more... the conceptual model we're all quite comfortable with is this notion of cascading, materialized views. The challenge, however, is that they are expensive to maintain in a big data pipeline context -- paid either in system resource cost, or developer cost. The only reasonable way to achieve this is a fundamental shift away from imperative pipelines, and to declarative orchestration (a few folks mention this as well). We've seen this in other domains with technologies like React, Terraform, Kubernetes, and more to great success.I've written about this in tldr form @ <a href="https://www.ascend.io/data-engineering/" rel="nofollow">https://www.ascend.io/data-engineering/</a>, namely the evolution from ETL, to ELT, to (ETL)+, to Declarative. A also gave a more detailed tech talk on this topic @ <a href="https://www.youtube.com/watch?v=JcVTXC0qPwE" rel="nofollow">https://www.youtube.com/watch?v=JcVTXC0qPwE</a>.For those who are interested in a longer white paper on data orchestration methodologies, namely imperative vs declarative, this is a good read: <a href="https://info.ascend.io/hubfs/Whitepapers/Whitepaper-Orchestration-Approaches.pdf" rel="nofollow">https://info.ascend.io/hubfs/Whitepapers/Whitepaper-Orchestr...</a>

sasad大约 4 年前

What are some of the limitations of dbt ?

评论 #26226077 未加载

andyxor大约 4 年前

Also see this talk by Martin Klepmann on streams as materialized views:"Turning the database inside-out with Apache Samza" (2015): <a href="https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/" rel="nofollow">https://www.confluent.io/blog/turning-the-database-inside-ou...</a>previous discussion <a href="https://news.ycombinator.com/item?id=9145197" rel="nofollow">https://news.ycombinator.com/item?id=9145197</a>

xodast1大约 4 年前

so each data pipeline is a pure function ? hmm geez if we had something that was all about pure functions and how they can be used to express real life problems.

gregw2大约 4 年前

Whoever wrote this hasn't worked on medium-complicated data pipelines / ETL logic.It's pretty non-trivial to try to make an effective-dated slowly changing dimension with materialized views.A good tool makes the medium-difficulty stuff easy, and the complicated stuff possible. Materialized views do only the former.I would love to be wrong about this.

评论 #26220081 未加载

评论 #26220100 未加载