Why isn't differential dataflow more popular?

228 点作者 jamii超过 4 年前

36 条评论

roenxi超过 4 年前

Indirectly answering the question - I've skimmed through the git README, the abstract and all the pictures in the academic paper that it references.I have no idea what this thing does. Can someone explain in simple terms what it does?My organisation is currently investigating on installing Spark on the theory that it connects to databases and we need analytics. As far as I can tell it breaks analytics work into parallel workloads.[0] <a href="https://github.com/TimelyDataflow/differential-dataflow/" rel="nofollow">https://github.com/TimelyDataflow/differential-dataflow/</a>

评论 #25872256 未加载

评论 #25870661 未加载

评论 #25869979 未加载

评论 #25870946 未加载

评论 #25871513 未加载

nemothekid超过 4 年前

I've always been interested in distributed stream processing platforms (Storm, Spark, Samza, Flink, etc) - and I've been interested in a distributed processing platform that wasn't on the JVM (there used to be one called Concord). That said, I came across differential dataflow a while ago (as I also began writing more and more Rust).I think the biggest issue is the documentation, not so much on writing code, but on building an actual production service using it. I think most of us can now grok that you have a Kafka Stream on one end and a datastore on the other, and the quintessential map/reduce hello world is WordCount.java. That doesn't isn't clear from the differential dataflow documentation - I remember thinking how are they getting data from the outside world into this thing, then thinking maybe I don't understand this project at all.Consider the example in the ReadMe - the hello world is "counting degrees in a graph". While it gives you an idea of how simple it is to express that compuation, it isn't interactive - it's unclear how one might change the input parameters (or if that's even possible). The hardest part of most of these frameworks is glue - but once you have that running then exploring what's possible is much easier. Differential Dataflow doesn't provide that for me right off the bat.That said - I'm not surprised, when I last checked it out Rust Kafka drivers weren't all there and it seemed to be evolving parallel to everything else. I think what would make it more popular is a mental translation of common Spark tasks (like WordCount) to differential dataflow.

评论 #25870215 未加载

评论 #25873086 未加载

theamk超过 4 年前

I don't think it is possible to compare "differential dataflow" with projects like Spark. (don't know about kafka streams)Spark is a "product" -- it has extensive documentation, supports multiple languages, and generally is production-grade. It is full of "nice-to-haves", like interactive status monitors, helper functions, serialization layers, and so on. It integrates with existing technologies (Hadoop, K8S, HDFS, etc..). It has this "finished product" feel."differential dataflow" seems to be a library. It only supports a single language. The documentation is very basic (it was not even clear if there is a way to run this on multiple machines or not). It is very bare-bones -- there are only a few dozens functions, and no resource monitors and interactive shells. It does not seem to integrate with anything. It has "research software" feel -- there are random directories in top-level repo, academic papers, and so on.(it would probably be more fair to compare "materialize" to Spark...)

评论 #25868237 未加载

lumost超过 4 年前

My 2 cents.Compute and storage capabilities keep growing rapidly, if one structures their data well, uses a reasonable query processor, and some form of underlying columnar storage then computing calculations on TBs of data can be accomplished in seconds for low costs.Being able to recompute the world from scratch is a p0 requirement for most analytic workloads, as otherwise migrations and un-forseen computation changes due to new product requirements and other activities become painful.This leaves differential techniques in an awkward spot where to be effective they need to1) Operate on vast quantities of data or sufficiently complex calculations such that optimization of compute is a concern to the end-user.2) Operate in a computational environment that is sufficiently constrained such that all present and future changes to the computation can be reasonably accounted for.3) Be transparent enough that engineers don't feel that they are duplicating logic.It's hard to think of applications that would meet this criteria outside of intelligent caching of computation within DB engines.

评论 #25871841 未加载

michannne超过 4 年前

>It's missing some important feature, like persistence?Hard to answer, considering....>It's had very little advertising?Personally, I've never heard of it and have no idea what differential dataflow is. Maybe I've done but I never gave or discovered a name for it. I don't know what spark or kafka streams are. Maybe because I've never had a use case for those that wasn't satisfied by a tool that was "good enough", or, more likely, I haven't come across anyone recommending those on projects, because they also don't know what those tools are. I would have never known what RabbitMQ was if a coworker never suggested we use it to build queues, and it turned out to be cumbersome to use and 100x more complicated than writing a stored procedure that turned out to be "good enough". Most tools fall into that space where they are marginally better in some regards over "good enough", but not better enough to accomadate for the learning curve for other developers, changes in maintenance or design, cost, etc. Advertising is pretty general and it's hard to say if which of these it's doing wrong, and depending on their market none of these might be wrong for them, just the potential market is content with "good enough" and have no need to search for tools like this.>Rust is intimidating?I'm not sure what the stats on Rust are but I don't think its that popular for business developers to where you could point to it for the reason a tool has failed the adoption phase

carapace超过 4 年前

"Build a better mousetrap and the world will beat a path to your door." is bunk. People don't automatically adopt new better things. I don't know why though.In my teens and twenties I collected ideas the way some folk collect stamps. The simple fact of the matter is that there are amazing things out there that you've never heard of, and no one really seems to care.(As an aside, I hate the question, "If FOO is so great why doesn't everyone use it?" I do not know. That's not my department.)These days some of these things are better known and some even have Wikipedia pages and stuff ( E.g. <a href="https://en.wikipedia.org/wiki/Vaneless_ion_wind_generator" rel="nofollow">https://en.wikipedia.org/wiki/Vaneless_ion_wind_generator</a> ) but a lot of others are still obscure (trawl through Rex Research if you want to look for weird tech.)Like, there's a mechanism that can absorb kinetic energy. The demo has a little car on rails with a ramp at one end and a wall at the other. They put a wineglass at the wall and they put the car on the ramp and let it go: the glass shatters. They activate the device and repeat: car hits glass and halts, glass does not shatter. Messed up, right? They're from Poland IIRC, they've been doing demos at trade shows. I bet you've never heard of them. (Bug me and I'll try to dig up a link; They're in Rex Research.)I already mentioned the "Vaneless" ion wind generator, an efficient solid-state device for converting wind into electric power without e.g. killing birds with spinning vanes. Cheap, simple, easy, durable, been around for decades, and you just now heard about it, eh? :)There's a battery that desalinizes salt water. A nuclear reactor made of molten salt. Balloons stronger than steel. There's a guy in Michigan, Wally Wallington, who figured out how to move monoliths single-handedly, they walk just like the old stories say!Anyway, I'm getting ranty here. To veer back on topic: yeah, it's a bummer, you build an awesome mousetrap and even the people with lots of mice ignore it. I wish I knew what to tell you. Maybe paint it mauve?

评论 #25872928 未加载

评论 #25872785 未加载

评论 #25872821 未加载

评论 #25872841 未加载

评论 #25872668 未加载

liminal超过 4 年前

I've looked at this and thought it looked amazing, but also haven't used it for anything. Some thoughts...Rust is a blessing and curse. I seems like the obvious choice for data pipelines, but everything big currently exists in Java and the small stuff is in Javascript, Python or R. Maybe this will slowly change, but it's a big ship to turn. I'm hopeful that tools like this and Balista [1] will eventually get things moving.Since the Rust community is relatively small, language bindings would be very helpful. Being able to configure pipelines from Java or Typescript(!) would be great.Or maybe it's just that this form of computation is too foreign. By the time you need it, the project is so large that it's too late to redesign it to use it. I'm also unclear on how it would handle changing requirements and recomputing new aggregations over old data. Better docs with more convincing examples would be helpful here. The GitHub page showing counting isn't very compelling.[1] <a href="https://github.com/ballista-compute/ballista" rel="nofollow">https://github.com/ballista-compute/ballista</a>

mywittyname超过 4 年前

These products are competing for mindshare in an incredibly saturated market. There are a lot of was to skin the data pipeline cat. I think a lot of companies already founded data engineering teams, all of whom have established tech-stacks for data engineering tasks.Personally, I keep an eye out on new technologies, but I'm not likely to embrace them without good reason. A fragmented tech stack is annoying.This looks an awful lot like Spark to me. And doesn't seem to really solve the problems I typically experience with data engineering. For me, the biggest issue is orchestration. I don't see any facilities here for managing and executing data pipelines.So, it seems to me that people aren't using dataflow more because it looks a lot like legacy products on the market. And it doesn't solve the massive problem of job orchestration and management. Apache Airflow + python + BigQuery is immensely powerful and dead simple to use. It's going to be hard to compete with.

akiselev超过 4 年前

The problem is these two:The api is too hard to use?The docs / tutorials are not good enough?DD falls into an uncanny valley where the API surface is simple enough to grasp quickly yet foreign enough that actually grokking it is pretty hard, let alone applying it in an organization where maintenance is a top concern. To do anything nontrivial, you need knowledge of timely-dataflow too and the DD documentation doesn't do a good job of integrating knowledge from TD docs - they're written by someone who has already internalized that knowledge so it's an afterthought. Getting data in and out of the dataflow and orchestrating workers is pretty much undocumented outside of Github issue discussions. Trying to abstract away dataflows behind types and functions turns into a big ol' generic mess. There are a lot of rough edges like that (and the abomination crate is... well... an abomination).McSherry's blog posts, while tantalizing, are often focused too much on static examples (entire dataset is available upfront) and are too academic-focused to make up for holes in the book. As far as I can tell, the library hasn't seen enough use for best practices to emerge and there's almost no guidance on how to build a real world system with DD.By far the biggest problem I've had: I can avoid a DD project for a week or two at most before enough knowledge leaves my memory that I have to spend days rereading my own code to get reoriented and productive again. You either use unlabeled tuples which turns the dataflow into an unholy mess or you spend half your time writing and deleting boilerplate when doing R&D. DD is just too weird and the API too awkward - I haven't figured out a method for writing straightforward DD code.That said, when I have gotten it to work on nontrivial problems, the performance and capabilities have been really impressive. I've just never been able to get the stars to align to use it in a professional context with future maintainers.I think what DD needs is a LYNQ-like composable query language that abstracts away the tuple datatypes and provides an ORM/query builder layer on top of dataflow statements. Most developers are familiar with SQL statements which would make DD a lot easier to adopt.

Grimm1超过 4 年前

Personally, after now seeing this, I think it's going to solve a problem for us that we're going to run into in the medium term so that's pretty neat, but we are dealing with a lot of data and re-computing certain things from scratch for us would be potentially prohibitive at our scale of data.I think the main issue is that in most shops is that the scale of their data isn't so large that a re-computation of a query with new data takes long enough that they would want to put it engineering effort to switch off more common tools like spark, airflow and columnar storage dbs. They're also likely, with decent engineering, not yet at a point where they run into tuning issues on their ingest side. An ETL taking an hour every night and then taking a couple seconds to run that query or even have that query set up on a job that just sends out a report isn't really an issue for most small - medium sized companies, and even at larger ones if your data throughput isn't particularly high I don't see people needing to reach for this for the same reasons.You obviously can do those less intensive tasks in DDF but it doesn't really strongly make a case for itself in those regards, largely because DDF doesn't seem to offer anymore benefit on those smaller tasks, 15s to 230ms is a really tremendous leap in performance but for many companies I doubt the 15s is a bottleneck in the first place so it's not actually solving a problem there, it would be a nice to have.

thelastbender12超过 4 年前

A possible reason not mentioned in the post is that writing efficient incremental algorithms is just fundamentally hard, despite the primitives and tooling afforded by the differential dataflow library. For example, even with a lot of machine learning libraries targeting python, there are only a couple that really implement online algorithms.

评论 #25870382 未加载

hansvm超过 4 年前

> It's missing some important feature, like persistence?For the use cases I'm envisioning, this strikes me as a nice-to-have, and even then only if the persistence API were sufficiently easy to use (or at least to avoid).> It's had very little advertising?I hadn't heard much about it till now.> Rust is intimidating?At work I need a killer reason to inflict _any_ language on everyone else. We have a lot of shallow computation graphs (really the same few graphs on different datasets) and a few deep graphs which need incremental updates. The cost of an ad-hoc solution is less than the perceived cost (maybe the real cost) of adopting an additional language.> OtherBroad classes of algorithms will basically expand to being full re-computations with this framework (based on a quick read of the whitepaper), and adopting a tool for efficient incremental updates is less enticing if I'm going to have to manually fiddle with a bunch of update algorithms anyway. E.g., kernel density estimation needs to be designed from the ground up to support incremental updates; a naive translation of those algorithms to dataflow would efficiently update some sub-components like covariance matrices, but you'd still wind up doing O(full_rebuild) work.

评论 #25876359 未加载

gen220超过 4 年前

We have something very similar to differential dataflow implemented at my current place of work, with our own home-brewed libraries and patterns that leverage the relatively unique way we store our data (most similar to TimeScaleDB).Like map reduce, most people do not understand how it works and why it is a useful paradigm.Unlike map reduce, there is not an entire sub-industry of companies offering it as a service, and engineers who have used it for years without contemplating its alternatives. In absence of this background noise, people assume DD is niche, and even "wrong" or harmful.We have new people who come in from time to time, who have experience working at a giant MR shop, who spend the first few months wondering aloud why we don't "just use <MR Framework>". They usually come around (if they care to understand how this new system works) or give up (usually because they never understood the MR trade-offs in the first place, but were unwilling to part with its style of thinking/working).One thing I'll note is our jargon around it is extremely minimal and literal. The diction employed by DD (and TimeScaleDB!) feels very formal in comparison, which can be off-putting to prospective users.I'm not one to advocate for dumbing down your tone (quite the opposite). However, it's interesting to note that the successful-yet-complicated projects (like MR, kafka) have an accretion disk around them of dumbed-down explanations on Medium, youtube and the like, that can lure in people who are curious but less-academic.I don't think you can manufacture these. It's just a matter of time, until things like this appear for DD.

评论 #25873956 未加载

HeyImAlex超过 4 年前

I’ve been curious about it, but it’s difficult to wrap my mind around. I’ve read a lot of frank mcsherry’s blog posts, watched his videos, been through the book, and I guess it just hasn’t clicked for me! I also don’t have any use cases that make sense as a hobby project, and abstractly I know it could be useful at work but I can’t evangelize something I don’t really understand.Rust took me around three attempts to get into, and it took a motivated project to really seal the deal, but at some point I understood enough that it just became programming again. Haven’t reached that with differential dataflow yet, but I’ll keep trying.

评论 #25872897 未加载

virgilp超过 4 年前

I wanted to pick it up, I feel it's under-appreciated technology that has lots of potential. Reasons why I didn't:- It's somewhat hard to sell to management. There (was) no company behind it to provide support; and it's not a "successful Apache project"/ with large-ish community, either. And generally for a long while it was a passion project more than something Frank McSherry would actively encourage you to use in production.- As other have said, the "hello world" is somewhat tricky. Not a lot of people know Rust. If you say "let's do this project in Rust", this will likely not go well; if I were able to use it from .NET and JVM, as a library, it might be an easier sell (I'm personally more invested in .NET now but earlier in my carer it would've been JVM)- last but not least: the "productization" story is a bit tricky; comparing it to Spark does it no service. For Spark, not only do I have managed clusters like EMR, but I have a decent amount of tooling to see how the cluster executes (the spark web UI). Also I can deploy it in mesos, yarn not just standalone (and mesos/yarn have their own tooling). For differential dataflow, one had none of that (at least last time I checked). Maybe it'd be more fair to compare it to Kafka Streams?<pre><code> * Might I add: spark-ec2 was a huge help for me picking up spark, since before the 1.0 version. You can do tons of work on a single machine, yes... but, for this kind of systems, the very first question is "how do you distribute that?". And you have the story that "it's possible", but you don't have easy examples of "word count, done on 3 machines, not because it's necessary but because we demonstrate how easy it is to distribute the computing across machines". * Compared to Kafka Streams: the thing about Kafka Streams is that you know what to use it for (Kafka!) and one immediately groks how one uses this in production (all state management is delegated to Kafka, this is truly just a library that helps you work better with Kafka). With differential dataflow, it's much less clear. You could use it with Kafka, but also with Twitter directly, or with something else. And what happens if it crashes? How do you recover from that? What are the data loss risks? Does it give you any guarantees or do you have to manage that?</code></pre>

scott_meyer超过 4 年前

I am a huge fan of Frank McSherry's work and don't necessarily agree with the premise that DD is somehow failing. However,...Batch data processing is very well understood, cheap and getting cheaper every year. So, if you can afford to boil the ocean every night, DD is a tough sell.The addressable market, customers with problems which can only be solved with DD (instantaneous exactly correct answers) is probably small right now.

mistersys超过 4 年前

I think the killer app for differential dataflow would be an easy to set up realtime database like Firebase, but with much richer real-time queries and materialized views.Materialize (built on differential dataflow) is cool but doesn't have the complete package of a persisted database.

评论 #25872064 未加载

asavinov超过 4 年前

Having a possibility to update (query) output with new input data rather than process the whole input again even if the changes are very small is indeed a very useful feature. Assume that you have one huge input table and you computed the result consisting of a few rows. Now you add 1 record to the input. A traditional data processing system will again process all the input records while the differential system will update the existing output result.There are the following difficulties in implementing such systems:o (Small) changes in input have to be incrementally propagated to the output as updates rather than new results. This changes the paradigm of data processing because now any new operator has to be "update-aware"o Only simple operators can be easily implemented as "update-aware". For more complex operators like aggregation or rolling aggregations, it is frequently not clear how it can be done conceptually (efficiently)o Differential updates have to be propagated through a graph of operations (topology) which makes the task more difficult.o Currently popular data processing approaches (SQL or map-reduce) were not designed for such a scenario so some adaptation might be neededAnother system where such an approach was implemented, called incremental evaluation, is Lambdo:<a href="https://github.com/asavinov/lambdo" rel="nofollow">https://github.com/asavinov/lambdo</a> - Feature engineering and machine learning: together at last!Yet, this Python library relies on a different novel data processing paradigm where operations are applied to columns. Mathematically, it uses two types of operations: set operations and functions operations, as opposed to traditional approaches based on only set operations.A new implementation is here:<a href="https://github.com/asavinov/prosto" rel="nofollow">https://github.com/asavinov/prosto</a> - Functions matter! No join-groupby, No map-reduce.Yet, currently incremental evaluation is implemented only for simple operations (calculated columns).

评论 #25869698 未加载

andi999超过 4 年前

I am not working in this domain, but why not put some (a lot of) numbers on the claim that it is dramatically faster than spark etc. Maybe show how a 10 hour spark problem can be reduced to minutes.

评论 #25868944 未加载

james_woods超过 4 年前

Where and how in dataflow is late data being handled? How can I configure in which ways refinements relate? These questions are the standard "What Where When How" I want to answer and put into code when dealing with streaming data. I was not able to find this in the documentation, but I only spent a few minutes scanning it.<a href="https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/" rel="nofollow">https://www.oreilly.com/radar/the-world-beyond-batch-streami...</a><a href="https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102/" rel="nofollow">https://www.oreilly.com/radar/the-world-beyond-batch-streami...</a>Also "Materialize" seems not to support needed features like tumbling windows (yet) when dealing with streaming data in SQL: <a href="https://arxiv.org/abs/1905.12133" rel="nofollow">https://arxiv.org/abs/1905.12133</a>Additionally "Materialize" states in their doc: State is all in totally volatile memory; if materialized dies, so too does all of the data. - this is not true for example for Apache Flink which stores its state in systems like RocksDB.Having SideInputs or seeds is pretty neat, imagine you have two tables of several TiBs or larger. This is also something that "Materialize" currently lacks: Streaming sources must receive all of their data from the stream itself; there is no way to “seed” a streaming source with static data.

评论 #25870239 未加载

_glsb超过 4 年前

This is interesting! For me - we just didn’t hear about differential dataflow. But we will probably use it in an upcoming project, because I was looking for a solution like that.

jkh1超过 4 年前

> whether there are also potential users who would have been perfectly happy with javascript/python/R bindings and a good tutorialIf you think people in some communities would benefit then you should be proactive in advertising there and in particular providing bindings for their favourite languages. This would enhance discoverability. In my corner of the tech and science world, people mostly use python and/or R but few know about Rust and fewer have knowingly used it.

评论 #25870261 未加载

andyferris超过 4 年前

For me, it's the fact that we aren't (currently at least) using rust. I would possibly, maybe, consider porting it to another language but haven't had the time...In general I wonder how many people sit in the intersection of those who are free and willing to base their system on rust, have dataflow problems to solve, and understand the advantages differential dataflow brings to the table?

评论 #25867964 未加载

krosaen超过 4 年前

It seems Reflow falls in this category:<a href="https://github.com/grailbio/reflow" rel="nofollow">https://github.com/grailbio/reflow</a>> Reflow thus allows scientists and engineers to write straightforward programs and then have them transparently executed in a cloud environment. Programs are automatically parallelized and distributed across multiple machines, and redundant computations (even across runs and users) are eliminated by its memoization cache. Reflow evaluates its programs incrementally: whenever the input data or program changes, only those outputs that depend on the changed data or code are recomputed.

justjonathan超过 4 年前

> People don't automatically adopt new better things. I don't know why though.I consulted for Motorola many years ago. I remember one of the Sr guys explaining to me their product view. New things needed to be 10x better than existing in order for MOT to be excited or want to invest in a new product, otherwise the switching cost / effort made it too risky that people wouldn’t bother to adopt a new thing.

zepto超过 4 年前

This looks like it’s comparable to to Incremental in OCaml.<a href="https://opensource.janestreet.com/incremental/" rel="nofollow">https://opensource.janestreet.com/incremental/</a>Jane Street uses Incremental quite heavily in their trading platform.My guess is that not a lot of people are using Rust to build the kinds of platform where this kind of library would see adoption, yet.

legerdemain超过 4 年前

Palantir has an incremental computation framework incorporated into its data processing platform.[1][1] Search for "Incremental Computability of a Dataset Transformation" <a href="https://patents.justia.com/patent/20180196862" rel="nofollow">https://patents.justia.com/patent/20180196862</a>

modeless超过 4 年前

Could this be used to build a compiler? That's what I really want, a compiler that updates the binary as I type.

zozbot234超过 4 年前

TensorFlow and Theano are quite popular, and they're all about expressing differentiable computations in a "dataflow"-based framework. It might be a simple case of needing to write some support code to make OP's desired use cases more straightforward when using these frwmeworks.

BrianOnHN超过 4 年前

Why is there little mention(one comment) of online algorithms?I was surprised by how little attention online algorithms received when I first had to implement one.My conclusion is that processing power currently overcomes the lack of definition or understanding people have about what they're building.

ilaksh超过 4 年前

People should stop assuming that merit and popularity are the same thing.Anyway, now that I know something with that name exists, maybe someday I can learn how it works. Or will have a project where it is important.

quentusrex超过 4 年前

I've been using DD in production usage for just over a year now for low latency(sub second from event IRL to pipeline CDC output) processing in a geo-distributed environment(100's of locations globally coordinating) some days at the TB per day level of event ingest.DD for me was one of the final attempts to find something, anything, that could handle the requirements I was working with, because Spark, Flink, and others just couldn't reasonably get close to what I was looking for. The closest 2nd place was Apache Flink.Over the last year I've read through the DD and TD codebases about 5-7 times fully. Even with that, I'm often in a position where I go back to my own applications to see how I had already solved a type of problem. I liken the project to taking someone use to NASCAR and dropping them into a Formula One vehicle. You've seen it work so much faster, and the tech and capabilities are clearly designed for so much more than you can make it do right now.A few learning examples that I consider funny:1. I had a graph that was on the order of about 1.2 trillion edges with about 90 million nodes. I was using serde derived structs for the edge and node structs(not simplified numerical types), which means I have to implement(or derive) a bunch of traits myself. I spent way more time than I'd like to admit trying to get .reduce() to work to remove 'surplus' edges that have already been processed from the graph to shrink the working dataset. Finally in frustration and reading through the DD codebase again, I 'rediscovered' .consolidate() which 'just worked' taking the 1.2 trillion edges down into the 300 million edges. For instance, some of the edge values I need to work with have histograms for the distributions, and some of the scoring of those histograms is custom. Not usually an issue, except having to figure out how to implement a bunch of the traits has been a significant hurdle.2. I get to constantly dance between DD's runtime and trying to ergonomically connect the application into the tonic gRPC and tokio interfaces. Luckily I've found a nice pattern where I create my inter-thread communication constructs, then start up 2 rust threads, and start tokio based interfaces in one, and DD runtime and workers in the other. On bigger servers(packet.net has some great gen3 instances) I usually pin tokio to 2-8 cores, and leave the rest of the cores to DD.3. Almost every new app I start, I run into the gotcha where I want to have a worker that runs only once 'globally' and it's usually the thread that I'd want to use to coordinate data ingestion. Super simple to just have a guard for if worker.index() == 0, but when deep in thought about an upcoming pipeline, it's often forgotten.4. For diagnostics, there is: <a href="https://github.com/TimelyDataflow/diagnostics" rel="nofollow">https://github.com/TimelyDataflow/diagnostics</a> which has provided much needed insights when things have gotten complex. Usually it's been 'just enough' to point into the right direction, but only once was the output able to point exactly to the issue I was running into.5. I have really high hopes for materialize.io That's really the type of system I'd want to use in 80% of the cases I'm using DD right now. I've been following them for about a year now, and the progress is incredible, but my use cases seem more likely to be supported in the 0.8->1.3 roadmap range.6. I've wanted to have a way to express 'use no more than 250GB of ram' and have some way to get a compile time feedback that a fixed dataset won't be able to process the pipeline with that much resources. It'd be far better if the system could adjust its internal runtime approach in order to stay within the limits.

评论 #25873010 未加载

teekert超过 4 年前

I use Snakemake and this sounds like it does the same? I think Snakemake is pretty popular? At least among bioinformaticians.

acjohnson55超过 4 年前

Perhaps it will be. I'm super excited about Materialize! If it really takes off, it will surely inspire other projects.

KptMarchewa超过 4 年前

On a quick glance, it seems that you should compare it more to Flink than Spark.

评论 #25871388 未加载

king_magic超过 4 年前

Scanned through the GitHub page. Don't really see immediate value for using it vs. other solutions.