Turning the database inside-out with Apache Samza

232 点作者 martinkl超过 10 年前

14 条评论

slashdev超过 10 年前

Immutability is hardly a cure-all, see the discussion here for why RethinkDB moved away from it: <a href="http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/" rel="nofollow">http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-g...</a>The reality is shared, mutable state is the most efficient way of working with memory-sized data. People can rant and rave all they want about the benefits of immutability vs mutability, but at the end of the day, if performance is important to you, you'd be best to ignore them.Actually, to be more honest, reality is more complicated still. MVCC that many databases use to get ACID semantics over a shared mutable dataset is really a combination of mutable and immutable.

评论 #9146318 未加载

评论 #9145956 未加载

评论 #9147259 未加载

评论 #9146017 未加载

pavlov超过 10 年前

... most self-respecting developers have got rid of mutable global variables in their code long ago.I'm not convinced that's the case. Almost everyone has merely hidden their mutable globals under layers of abstractions. Things like "singletons", "factories", "controllers", "service objects", "dependency injection" are the vernacular of the masked-globals game.

评论 #9146009 未加载

bmh100超过 10 年前

As one who works with analytics databases and ETL (extract-transform-load) processes a great deal, immutability of data stores is an incredibly valuable property. Maybe append-only does not make sense in operational databases all the time, but for non-real-time analytics, it makes a huge amount of sense. In my case, operational data is queried, optimized for storage space and quick loading, and cached to disk. Because it is an analytics database used for longer-term analysis and planning, daily queries of operational data are sufficient in many cases. Operational workload is not even a consideration. The ETL process also allows for "updating" records in the "T" (transform) part. Updates to operational data are not even necessary, and often impossible, so correcting and enhancing the data for decision making is a huge win for clients. Issues similar to "compaction time" can still occur, but an ETL approach allows for many clean ways of controlling the process and avoiding those failure scenarios.

boredandroid超过 10 年前

Anyhow in the Bay Area interested in learning more about Apache Samza should attend the meetup tonight in Mountain View: <a href="http://www.meetup.com/Bay-Area-Samza-Meetup/events/220354853/" rel="nofollow">http://www.meetup.com/Bay-Area-Samza-Meetup/events/220354853...</a>

shanemhansen超过 10 年前

I'm not sold on Samza, but I can tell you that creating isolated services that create their datastore from a stream of events is a really useful pattern in some use cases (ad-tech).I've made use of NSQ to stream user update events (products viewed, orders placed) to servers sitting at the network edge which cache the info in leveldb. Our request latency was something like 10 microseconds over go's json/rpc. We weren't even able to come close to that in the other nosql database servers we tried, even with aggressive caching turned on.

评论 #9146476 未加载

sivers超过 10 年前

Similar interesting talk by Rich Hickey:<a href="http://www.infoq.com/presentations/Value-Values" rel="nofollow">http://www.infoq.com/presentations/Value-Values</a>

评论 #9147272 未加载

vkjv超过 10 年前

You can do similar "magic" cache invalidation with Elasticsearch and the percolate feature. Each time you do a query and cache some transformation of the result, put that query in a percolate index. Then when you change a document, run the document against the percolate index and, voila, you get the queries that would have returned it and can then invalidate your cache.This method of cache invalidation fails in a very key place though (just like in the article). What happens if you change a very core thing that invalidates a large percentage of the cache?

评论 #9147577 未加载

bonobo3000超过 10 年前

This is a cool idea - the holy grail scenario I'm envisioning is storing all data in the log i.e1. the transaction log is a central repository for all data 2. much more detailed data is stored, enough that analytics and can run off this same source of dataThe amount of data generated increases proportional to the number of updates on a row/piece of data whereas with a mutable solution, it is constant w.r.t number of updates on the same data. That is a pretty big scaling difference.However, storing that much data translates to much higher costs for HDDs/servers, or possibly lower write performance if the log is stored on something like HDFS.There would also be performance costs for building and updating a materialized view. Imagine a scenario like this:Events -> A B C D E F G H I J K Materialized view M has been computed up to item J (but not K yet) Read/Query MNow either writing K incurs the cost of waiting for all dependent views to materialize, or the read on M incurs the cost of updating M.Some fusion of this would be pretty interesting though. For example, what if we just query on M without applying any updates if there have been <X updates? That translates to similar guarantees as an eventually consistent DB - the data could be stale. Atleast it gives us more control over this tradeoff.

swah超过 10 年前

I really enjoyed reading about Storm too: <a href="http://nathanmarz.com/blog/history-of-apache-storm-and-lessons-learned.html" rel="nofollow">http://nathanmarz.com/blog/history-of-apache-storm-and-lesso...</a>This kind of "competition" leads to analysis paralysis though. Its much better when there is a single winner...

评论 #9148522 未加载

bambax超过 10 年前

A more promising model, used in some systems, is to think of a database as an always-growing collection of immutable facts.That would already be a huge progress over how databases are currently used; if records were in fact immutable many problems would be instantly solved.

评论 #9145988 未加载

steve-rodrigue超过 10 年前

Does anyone knows which app has been used to create the "handwritten" images? I draw very badly so I'm looking for such an app to explain data flows on a corporate blog/wiki.

评论 #9147427 未加载

评论 #9146838 未加载

评论 #9147286 未加载

评论 #9146866 未加载

hyc_symas超过 10 年前

Streams - another reinvention of LDAP Persistent Search.Yes, there really are protocols that handle single request/multiple response interactions, and they've been around for decades. Unlike crap built on HTTP, which was never intended for uses like this, these protocols work well with multiple concurrent requests in flight simultaneously, etc.

hyperliner超过 10 年前

Conceptually, one of the challenges of streams as first class citizens is that humans don't do well with them. For the purposes of analysis, humans need a "snapshot" or fix on the data. This way they can derive insights from the data and act on human things. The reality is that, for many real-world scenarios, a real-time view of the data is not just a luxury, it's actually a drawback, because data changes are noisy. Many human problems deal with abstract representations of the actual data, and so imprecision is part of the problem.I really like the talk from the point of view of simplifying the system-wide problems caused by a gigantic mutable state. But I feel that at the border of system to humans there will be other issues to discuss.

fiatjaf超过 10 年前

This is CouchDB, right?