EventReduce: An algorithm to optimize database queries that run multiple times

208 点作者 eventreduce大约 5 年前

19 条评论

danbruc大约 5 年前

Did I understand this correctly? You have a single set of items which you query by evaluating a predicate on each of them and then sort the matching ones. After the initial query you update the query result by looking at all the data update events, i.e. you remove delete items from the result, you insert matching new items in the correct position according to the sort order and you insert, remove, or move updated items as they start or stop matching the predicate and change their position according to the sort order.

评论 #22888713 未加载

pgt大约 5 年前

Materialize exists to efficiently solve the view maintenance problem: <a href="https://materialize.io/" rel="nofollow">https://materialize.io/</a>

评论 #22888579 未加载

评论 #22889230 未加载

评论 #22927257 未加载

评论 #22888431 未加载

cryptonector大约 5 年前

How does this work for complex queries with sub-queries, LEFT OUTER JOINs, LATERAL JOINs, aggregation (DISTINCT/GROUP BY), window functions, CTEs, RECURSIVE CTEs?I've a radically different approach: can the queries in question as VIEWs, materialize them, use triggers to update materializations where you can write those triggers easily and the updates are quick, or schedule an update where they're not.If your RDBMS is very good about pushing WHERE constraints into VIEWs, and depending on how complex a VIEW query is, you might be able to make the update automatic by just querying the materialization's underlying VIEW with appropriate WHERE constraints from the ROWs being INSERTed/UPDATEd/DELETEd. You can tell which VIEWs might suitable for this by checking that the TABLE whose row the trigger is running for is a "top-level" table source for the VIEW's query: meaning a table source that's either the left side of a top-level LEFT JOIN, or either side of an INNER JOIN. If you can run a query on the VIEW with a timeout then you can just do that in the trigger and mark the materialization as needing an update if the query is too slow. Lastly, a scheduled or NOTIFYed job can run to perform any slower updates to a materialization.

评论 #22897081 未加载

throwaway_pdp09大约 5 年前

Forgive me if I missed stuff, please point me in the right direction if you've covered it, but some questions if I may (after I've said well done!). I've considered this problem before and it seems very difficult . So:1. Do you have a paper on this with a rigorous justification of the algorithm?2. This surely has to rely on the isolation level being pretty high, or EventReduce might be reading while n other processes are updating. I don't see that mentioned.3. Surely you need logical clocks for this? If not, could you point me to a high-level description of the algorithm to show why they aren't necessary.4. Why does sort order matter? A timestamp yes (see 3. above), but I don't understand why the order matters.thanks (and trying to understand this might be the thing to get me into looking at BDDs again. I never understood their value).

评论 #22888460 未加载

telekid大约 5 年前

This seems conceptually similar to differential dataflow.<a href="https://github.com/timelydataflow/differential-dataflow/blob/master/differentialdataflow.pdf" rel="nofollow">https://github.com/timelydataflow/differential-dataflow/blob...</a>

cntlzw大约 5 年前

"EventReduce can be used with relational databases but not on relational queries that run over multiple tables/collections."Forgive my ignorance, but that is the whole point of working with a relational database. If cannot use JOINS then this solves only a very limited use case.

评论 #22890312 未加载

评论 #22890178 未加载

joefreeman大约 5 年前

This sounds like (a simpler version of?) Lambda Architecture [1, 2][1] <a href="https://en.wikipedia.org/wiki/Lambda_architecture" rel="nofollow">https://en.wikipedia.org/wiki/Lambda_architecture</a> [2] <a href="https://www.manning.com/books/big-data" rel="nofollow">https://www.manning.com/books/big-data</a>

评论 #22888676 未加载

评论 #22888535 未加载

Zaheer大约 5 年前

The goal of this is to reduce DB queries? Why not just queue up / batch writes? What benefits does this provide over application side batching of the event.EvenReduce assumes there are no other systems interacting with the DB state (by using the old state that the current system saw). If there are no other systems, simple batching would work fine.

评论 #22893739 未加载

pfarrell大约 5 年前

I’m going to look at the code, but how does are transactions in the database handled in eventreduce?Specifically, I’m wondering about isolation levels which determine whether uncommitted changes are queryable before commit/rollback.

truth_seeker大约 5 年前

IMO An open cursor of Change Stream with Aggregation pipeline (for given use-case) in MongoDB is more flexible solution to achieve this functionality.In addition, it also tracks the history of changes and hence allows the cursor to go back if needed with "resumeToken"<a href="https://docs.mongodb.com/manual/changeStreams/" rel="nofollow">https://docs.mongodb.com/manual/changeStreams/</a>

评论 #22889918 未加载

marceloabsousa大约 5 年前

Very cool! This reminds me of some research I did a few years ago on program consolidation: <a href="https://dl.acm.org/doi/10.1145/2594291.2594305" rel="nofollow">https://dl.acm.org/doi/10.1145/2594291.2594305</a>

评论 #22890323 未加载

LunaSea大约 5 年前

Databases like PostgreSQL don't offer insights into the query plans, does EventReduce parse the SQL statements to determine which tables and rows will be affected by a query and run the appropriate caching or cache invalidation logic?

评论 #22889877 未加载

评论 #22889979 未加载

评论 #22889682 未加载

评论 #22889535 未加载

AmericanBlarney大约 5 年前

Many applications solve this by using memory caching (e.g. Redid, memcached, etc.) of performance sensitive datasets. There are a lot of drawbacks to the approach, to the point that I would avoid it altogether.

评论 #22893779 未加载

pfarrell大约 5 年前

Have you compared with an in-memory data store like Redis? Due to the lack of support for joins, that seems like a more natural comparison than A relational database.

gitgud大约 5 年前

Interesting, seems similar to Firebase's Firestore NoSQL database. In that you can create a complex query and receive real-time updates on each query.

regularfry大约 5 年前

So... a write-through cache?

alecco大约 5 年前

The use case seems better fit for a streaming database.

评论 #22888728 未加载

dahauns大约 5 年前

Sooo...Materialized/Indexed Views?

评论 #22888467 未加载

评论 #22888474 未加载

Asraful56大约 5 年前

Nice

19 条评论

danbruc大约 5 年前

评论 #22888713 未加载

pgt大约 5 年前

Materialize exists to efficiently solve the view maintenance problem: <a href="https://materialize.io/" rel="nofollow">https://materialize.io/</a>

评论 #22888579 未加载

评论 #22889230 未加载

评论 #22927257 未加载

评论 #22888431 未加载

cryptonector大约 5 年前

评论 #22897081 未加载

throwaway_pdp09大约 5 年前

评论 #22888460 未加载

telekid大约 5 年前

cntlzw大约 5 年前

评论 #22890312 未加载

评论 #22890178 未加载

joefreeman大约 5 年前

评论 #22888676 未加载

评论 #22888535 未加载

Zaheer大约 5 年前

评论 #22893739 未加载

pfarrell大约 5 年前

truth_seeker大约 5 年前

评论 #22889918 未加载

marceloabsousa大约 5 年前

评论 #22890323 未加载

LunaSea大约 5 年前

评论 #22889877 未加载

评论 #22889979 未加载

评论 #22889682 未加载

评论 #22889535 未加载

AmericanBlarney大约 5 年前

评论 #22893779 未加载

pfarrell大约 5 年前

Have you compared with an in-memory data store like Redis? Due to the lack of support for joins, that seems like a more natural comparison than A relational database.

gitgud大约 5 年前

Interesting, seems similar to Firebase's Firestore NoSQL database. In that you can create a complex query and receive real-time updates on each query.

regularfry大约 5 年前

So... a write-through cache?

alecco大约 5 年前

The use case seems better fit for a streaming database.

评论 #22888728 未加载

dahauns大约 5 年前

Sooo...Materialized/Indexed Views?

评论 #22888467 未加载

评论 #22888474 未加载

Asraful56大约 5 年前

Nice