Materialize: A Streaming Data Warehouse

262 pointsby irfansharifover 5 years ago

26 comments

luhnover 5 years ago

I didn't really understand what the product actually did after reading this blog post or the products page. I found the docs much more edifying:> Materialize lets you ask questions about your data, and then get the answers in real time.> Why not just use your database’s built-in functionality to perform these same computations? Because your database often acts as if it’s never been asked that question before, which means it can take a long time to come up with an answer, each and every time you pose the query.> Materialize instead keeps the results of the queries and incrementally updates them as new data comes in. So, rather than recalculating the answer each time it’s asked, Materialize continually updates the answer and gives you the answer’s current state from memory.> Importantly, Materialize supports incrementally updating a much broader set of views than is common in traditional databases (e.g. views over multi-way joins with complex aggregations), and can do incremental updates in the presence of arbitrary inserts, updates, and deletes in the input streams.<a href="https://materialize.io/docs/" rel="nofollow">https://materialize.io/docs/</a>

评论 #22361935 未加载

评论 #22360820 未加载

slap_shotover 5 years ago

> We believe that streaming architectures are the only ones that can produce this ideal data infrastructure.I just want to say this is a very dangerous assumption to make.I run a company that helps our customers consolidate and transform data from virtually anywhere in their data warehouses. When we first started, the engineer in me made the same declaration, and I worked to get data into warehouses seconds after and event or record was generated in an origin system (website, app, database, salesforce, etc).What I quickly learned was that analysts and data scientists simply didn't want or need this. Refreshing the data every five minutes in batches was more than sufficient.Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data. So even if we could get the data in the warehouse in sub-minute latency, the jobs to transform that data ran every 5 minutes.To be clear, I don't discount the need of telemetry and _some_ data to be actionable in a smaller time frame, I'm just weary of a data warehouse fulfilling that obligation.In any event, I do think this direction is the future (an overwhelming amount of data sources allow change data capture almost immediately after an event occurs), I just don't think it's only architecture that can satisfy most analysts'/data scientists' needs today.I would love to hear the use cases that your customers have that made Materialize a good fit!

评论 #22360874 未加载

评论 #22361110 未加载

评论 #22364601 未加载

评论 #22360674 未加载

评论 #22360826 未加载

chrisjcover 5 years ago

Would it be fair to say this is a more OLAP-oriented approach to what KSqlDB (not KSql, but <a href="https://ksqldb.io/" rel="nofollow">https://ksqldb.io/</a>) does?Seems that it's perhaps lacking the richness of how ksqldb uses Kafka Connectors (sinks and sources), but I don't see any reason you couldn't use Materialize in conjunction with ksqldb.Eg:KC-source --> ksql --> materialize --> kafka --> KC-sinkQuestion to Materialize...What connectors (sinks and sources) do you have or plan to develop? Seems like it's mostly Kafka in and out at the moment.Why would I use this over KSqlDB?Can I snapshot and resume from the stream? Or do I need to rehydrate to re-establish state?

评论 #22360694 未加载

评论 #22360649 未加载

drejover 5 years ago

I really like the pg protocol (like e.g. Cockroach), it let me use my usual tools. There are a few things I noticed:1. It has a fairly rich support for types - these new-ish SQL engines often lack quite a lot of things, but this seems pretty decent. 2. I don't see any comparisons to KSQL, which seems to be the primary competitor. 3. Read the license. Read it carefully. It has a weird "will become open source in four years" clause, so keep that in mind. It also disallows it being hosted for clients to use (esentially as a DBaaS).

评论 #22364573 未加载

irfansharifover 5 years ago

(Linked in the post but) github repo: <a href="https://github.com/MaterializeInc/materialize" rel="nofollow">https://github.com/MaterializeInc/materialize</a>

samuellover 5 years ago

For anyone interested in the details behind all of this, you should check out Frank's blog:<a href="https://github.com/frankmcsherry/blog" rel="nofollow">https://github.com/frankmcsherry/blog</a>

kiwicoppleover 5 years ago

For anyone that might be considering trying something similar with their own Postgres database (PG10+), we recently opensourced this: <a href="https://github.com/supabase/realtime" rel="nofollow">https://github.com/supabase/realtime</a>It's an Elixir (Phoenix) server that listens to PostgreSQL's native replication, transforms it into JSON, then blasts it over websockets.I see that Materialize are using Debezium, which will give you a similar result, just with connectors to Kafka etc

评论 #22364642 未加载

yayrover 5 years ago

I am curious about the physical storage. Is it purely in-memory or is there a disk persistency possible? Is there some kind of data compression applied or what are the memory needs of it? Is it a row or column based data persistence pattern?

gaogaoover 5 years ago

The "you may not cluster any server instances of the Licensed Work together for one use" in the license is a fairly tricky clause. Under this clause, how would one run a fault-tolerant instance of Materialize?

评论 #22362934 未加载

strebloover 5 years ago

How does materialize compare in performance (especially ingress/egress latency) to other OLAP systems like Druid or ClickHouse? Would love to see some benchmarks.

评论 #22364627 未加载

评论 #22363243 未加载

solidangleover 5 years ago

> Blazing fast resultsI highly doubt this, given that the query engine is interpreted and non-vectorized. Queries are 10x to a 100x slower on a simple query, and 100x to 1000x slower on a query with large aggregations and joins without compilation of vectorization.> Full SQL ExplorationExcept for window functions it seems. These actually matter to data analysts.

评论 #22361453 未加载

评论 #22361473 未加载

评论 #22363842 未加载

1290ccover 5 years ago

Pretty cool tech although I feel they may have missed the moment as AWS, Azure and GCP are becoming hypercompetitive to solve all things related to data/storage. Azure has been churning out major updates to its services and clearly taking inspiration from companies like Snowflake. AWS I think hesitated to compete with Snowflake as they were running on AWS anyway - win/win for them.Snowflake had incredible timing as they hit the market just before CFO's and non-tech business leaders realized the cost and talent needed to pull off a datalake successfully was more than they'd like. Those that were sick of the management jumped to Snowflake fast and AWS/Azure never really responded until recently.Awesome to see all the innovative takes on solving these extremely technical problems! I love it!

评论 #22363154 未加载

cbdumasover 5 years ago

I'm not sure if it's a mistake or just some cheeky humor but the "Known Limitations" link on their Docs page returns a 404.

评论 #22360223 未加载

manigandhamover 5 years ago

Congrats on the launch, always nice to see new products.This is an interesting mix between the (now obsolete) PipelineDB, TimescaleDB with continuous aggregates, Kafka and other message systems with KSQL/ksqlDB/KarelDB, stream processing engines like Spark, and typical RDBMS like SQL Server with materialized views.The amount of research to support complex and layered queries definitely sets this apart.

mason55over 5 years ago

Not sure how the featuresets compare but AWS is releasing materialized views for Redshift sometime soon and one of the things it will support is incremental refresh (assuming your view meets some criteria).I'm sure Materialize is better at this since it's purpose-built but if you're on Redshift you can get at least some of the benefits of incremental materialize.

评论 #22363314 未加载

评论 #22362126 未加载

wiradikusumaover 5 years ago

Materialize connects directly to event stream processors (like Kafka) --- how about Pulsar? (Goggling doesn't yield anything useful, Materialize and Pulsar are both name of brands of other things)

评论 #22364353 未加载

simo7over 5 years ago

I'm wondering how this technology could work for OLAP cubes.An OLAP cube that is automatically & incrementally kept in sync with the changes in the source data sounds promising.Is that a potential use case?

polskibusover 5 years ago

Looking back at the project, knowing what you know now, if you were to start again (but without obtained rust skills), would you go with rust again or pick another toolbox?

评论 #22366743 未加载

mnkmnkabout 5 years ago

How does this compare to Spark structured streaming? That too allows writing SQL on streams and having the state update incrementally.

gbritsover 5 years ago

Is this similar to TimescaleDB's Continuous Aggregates? Interested in knowing the overlap / differences.

评论 #22361809 未加载

rl3over 5 years ago

Their careers page says it's built in Rust.That's always nice to see, since Rust jobs are somewhat rare.

gbritsover 5 years ago

Looks promising! Can materialized views be backfilled?

评论 #22362983 未加载

评论 #22362936 未加载

sashavingardt2over 5 years ago

I'm intrigued by the product. The SQL examples on the product page are atrocious though ((((

programmarchyover 5 years ago

Isn't BigQuery real-time as well? For me, the wow factor is that you can host this yourself.

评论 #22361968 未加载

评论 #22363185 未加载

tschmidleithnerover 5 years ago

See also <a href="https://news.ycombinator.com/item?id=22346915" rel="nofollow">https://news.ycombinator.com/item?id=22346915</a>

评论 #22360330 未加载

justlexi93over 5 years ago

Clickhouse has materilized views and is free.

评论 #22362922 未加载

26 comments

luhnover 5 years ago

评论 #22361935 未加载

评论 #22360820 未加载

slap_shotover 5 years ago

评论 #22360874 未加载

评论 #22361110 未加载

评论 #22364601 未加载

评论 #22360674 未加载

评论 #22360826 未加载

chrisjcover 5 years ago

评论 #22360694 未加载

评论 #22360649 未加载

drejover 5 years ago

评论 #22364573 未加载

irfansharifover 5 years ago

(Linked in the post but) github repo: <a href="https://github.com/MaterializeInc/materialize" rel="nofollow">https://github.com/MaterializeInc/materialize</a>

samuellover 5 years ago

For anyone interested in the details behind all of this, you should check out Frank's blog:<a href="https://github.com/frankmcsherry/blog" rel="nofollow">https://github.com/frankmcsherry/blog</a>

kiwicoppleover 5 years ago

评论 #22364642 未加载

yayrover 5 years ago

gaogaoover 5 years ago

评论 #22362934 未加载

strebloover 5 years ago

How does materialize compare in performance (especially ingress/egress latency) to other OLAP systems like Druid or ClickHouse? Would love to see some benchmarks.

评论 #22364627 未加载

评论 #22363243 未加载

solidangleover 5 years ago

评论 #22361453 未加载

评论 #22361473 未加载

评论 #22363842 未加载

1290ccover 5 years ago

评论 #22363154 未加载

cbdumasover 5 years ago

I'm not sure if it's a mistake or just some cheeky humor but the "Known Limitations" link on their Docs page returns a 404.

评论 #22360223 未加载

manigandhamover 5 years ago

mason55over 5 years ago

评论 #22363314 未加载

评论 #22362126 未加载

wiradikusumaover 5 years ago

Materialize connects directly to event stream processors (like Kafka) --- how about Pulsar? (Goggling doesn't yield anything useful, Materialize and Pulsar are both name of brands of other things)

评论 #22364353 未加载

simo7over 5 years ago

polskibusover 5 years ago

Looking back at the project, knowing what you know now, if you were to start again (but without obtained rust skills), would you go with rust again or pick another toolbox?

评论 #22366743 未加载

mnkmnkabout 5 years ago

How does this compare to Spark structured streaming? That too allows writing SQL on streams and having the state update incrementally.

gbritsover 5 years ago

Is this similar to TimescaleDB's Continuous Aggregates? Interested in knowing the overlap / differences.

评论 #22361809 未加载

rl3over 5 years ago

Their careers page says it's built in Rust.That's always nice to see, since Rust jobs are somewhat rare.

gbritsover 5 years ago

Looks promising! Can materialized views be backfilled?

评论 #22362983 未加载

评论 #22362936 未加载

sashavingardt2over 5 years ago

I'm intrigued by the product. The SQL examples on the product page are atrocious though ((((

programmarchyover 5 years ago

Isn't BigQuery real-time as well? For me, the wow factor is that you can host this yourself.

评论 #22361968 未加载

评论 #22363185 未加载

tschmidleithnerover 5 years ago

See also <a href="https://news.ycombinator.com/item?id=22346915" rel="nofollow">https://news.ycombinator.com/item?id=22346915</a>

评论 #22360330 未加载

justlexi93over 5 years ago

Clickhouse has materilized views and is free.

评论 #22362922 未加载