Kafka Is Not a Database

313 点作者 andrioni超过 4 年前

30 条评论

Alternatively from Jay Krebs [1] a much more thorough and nuanced discussion that is probably the best send-up on this topic."So is it crazy to do this? The answer is no, there’s nothing crazy about storing data in Kafka: it works well for this because it was designed to do it. Data in Kafka is persisted to disk, checksummed, and replicated for fault tolerance. Accumulating more stored data doesn’t make it slower. There are Kafka clusters running in production with over a petabyte of stored data."[1] <a href="https://www.confluent.io/blog/okay-store-data-apache-kafka/" rel="nofollow">https://www.confluent.io/blog/okay-store-data-apache-kafka/</a>

评论 #25349821 未加载

评论 #25351088 未加载

评论 #25350107 未加载

评论 #25349695 未加载

评论 #25349101 未加载

评论 #25355975 未加载

评论 #25352391 未加载

评论 #25350721 未加载

j-pb超过 4 年前

Tbh, It's a weird blog post coming from the materialize folks, considering they know better.The "event sourced" arch they sketched is missing pieces. Normaly you'd have single writer instances that are locked to the corresponding kafka partition, which ensure strong transactional guarantees, IF you need them.Throwing shade for maketings sake is something that they should be above.I mean c'mon, I'd argue that Postgres enhanced with Materialize isn't a database anymore either, but in a good sense!It's building material. A hybrid between MQ, DB, backend logic & frontend logic.The reduction in application logic and the increase in reliability you can get from reactive systems is insane.SQL is declarative, reactive Materialize streams are declarative on a whole new level.Once that tech makes it into other parts of computing like the frontend, development will be so much better, less code, less bug, a lot more fun.Imagine that your react component could simply declare all the data it needs from a db, and the system will figure out all the caching and rerendering.So yeah, they have awesome tech with many advantages, so I don't get why they bad-mouth other architectures.

评论 #25350296 未加载

评论 #25350068 未加载

评论 #25349508 未加载

评论 #25348733 未加载

zaphar超过 4 年前

If it stores data it's a database. Filesystems are databases, MongoDB is a database. LevelDB is a database. Postgres and MySQL are databases. Kafka is a database. They are all very different in features and functionality though.What the authors mean is that kafka is not a traditional database and doesn't solve the same problems that traditional databases solve. Which is a useful distinction to make but is not the distinction they make.The reality is that database is now a very general term and for many usecases you can choose to special purpose databases for what you need.

评论 #25348141 未加载

Spivak超过 4 年前

I feel like the inventory thing is a bit of a straw-man because the situation is set out in such a way that you need transactions for it to work. If you find yourself wishing you had a global write-lock on a topic to then of course it won't work. Modeling your data for Kafka is work just the same as it is for MySQL. Of course it might not be the best tool for the job but you should at least give it a fair shake.You should be able to post "buy" messages to a topic without fear that it messes up your data integrity. Who cares if two people are fighting over the last item? You have a durable log. Post both "buys" and wait for the "confirm" message from a consumer that's reading the log at that point in time, validates, and confirms or rejects the buys. At the point that the buy reaches a consumer there is enough information to know for sure whether it's valid or not. Both of the buy events happened and should be recorded whether they can be fulfilled or not.

评论 #25350293 未加载

je42超过 4 年前

Ok. I admit using Kafka as DB is not straight forward but just stating it doesn't provide ACID functionality is not enough.The example they give is very simplistic. With the correct design of kafka topics and events the problem of the example can be fixed.And according to oracle <a href="https://www.oracle.com/database/what-is-database/" rel="nofollow">https://www.oracle.com/database/what-is-database/</a> :> A database is an organized collection of structured information, or data, typically stored electronically in a computer system.So Kafka clearly fits that definition.

评论 #25349168 未加载

评论 #25349130 未加载

评论 #25349092 未加载

tacitusarc超过 4 年前

I think because software engineers tend to excel at pattern recognition, oftentimes solutions to different problems appear so similar that it seems like with a small amount of abstraction, they can be reused. But it's a trap!Everything abstracted to the highest level is the same, but problems aren't solved at the highest level.The devil, as they say, is in the details.

评论 #25347479 未加载

UK-Al05超过 4 年前

One way around this is to make sure your kafka command streams are processed in order, in serial partitioned by an id where you want the concurrency control.Normally you only want concurrency control within certain boundaries.By figuring out the minimum amount transaction and concurrency boundaries you can inch out quite a bit of performance.

评论 #25347678 未加载

shay_ker超过 4 年前

This is maybe a silly question, but what's the difference between the timely dataflow that Materialize uses and Spark's execution engine? From my understanding they're doing very similar things - break down a sequence of functions on a stream of data, parallelize them on several machines, and then gather the results.I understand that the feature set of timely dataflow is more flexible than Spark - I just don't understand why (I couldn't figure it out from the paper, academic papers really go over my head).

评论 #25350474 未加载

评论 #25350382 未加载

fredliu超过 4 年前

Kafka is essentially commit logs, which are at the core of any traditional database engines. Streaming is just turning the gut of DB inside out (mostly for scalability reasons), while DB is wrapped up commit logs that provides higher level functionalities (ACID, Transactions, etc.). It's two sides of the same coin, yin and yang of the same thing... But on the practical side of things, yes, if what you needed more are indeed what's described in this article, your life would be easier with a traditional DB.

jdmichal超过 4 年前

So, the problem really being addressed but not named is that eventing systems give eventual consistency. But sometimes that's not good enough. And it's OK to admit that and bring in another technology when you need a stronger guarantee than that.The example I was taught with was a booking system, where the inventory management system-of-record was separate from the search system. Search does not need 100% up-to-date inventory. A delay between the last item being booked and it being removed from the search results is acceptable. In fact, it has to be acceptable, because it can happen anyway. If someone books the last item after another hit the search button... There's nothing the system can do about that.When actually committing a booking, however, then that must be atomically done within the inventory management system.So, to bring it home, it's OK for the search system to be eventually consistent against bookings, and read bookings off of an event stream to update its internal tracking. However, the bookings themselves cannot be eventually consistent without risking a double-booking.

jkarneges超过 4 年前

Another potential misuse of Kafka I've been wondering about is how a single Kafka instance/cluster is often shared by multiple microservices.On one hand the ability to connect multiple microservices to a central message broker is convenient, but on the the other hand this goes against the microservice philosophy of not sharing subcomponents (databases, etc). I wonder where the lines should be drawn.

评论 #25347982 未加载

评论 #25347931 未加载

评论 #25349055 未加载

Cojen超过 4 年前

Jim Gray disagrees: <a href="https://arxiv.org/ftp/cs/papers/0701/0701158.pdf" rel="nofollow">https://arxiv.org/ftp/cs/papers/0701/0701158.pdf</a>

评论 #25348618 未加载

soumyadeb超过 4 年前

The architecture of dumping events into Kafka and creating materialized views is a perfect choice for many use cases - e.g. collecting clickstream data and building analytical reports.If ACID is a prerequisite, then lot of things won't classify as databases - None of Mongo, Cassandra, ElasticSearch etc. Not even many data-warehouses.

Kalium超过 4 年前

As recently as last year, I worked for a company where the Chief Architect, in his infinite wisdom, had decided that a database was a silly legacy thing. The future looked like Kafka streams, with each service being a function against Kafka streams, and data retention set to infinite.Predictably, this setup ran into an interesting assortment of issues. There were no real transactions, no ensured consistency, and no referential integrity. There was also no authentication or authorization, because a default-configured deployment of Kafka from Confluent happily neglects such trivial details.To say this was a vast mess would be to put it lightly. It was a nightmare to code against once you left the fantasy world of functional programming nirvana and encountered real requirements. It meant pushing a whole series of concerns that isolation addresses into application code... or not addressing them at all. Teams routinely relied on one another's internal kafka streams. It was a GDPR nightmare.Kafka Connect was deployed to bridge between Kafka and some real databases. This was its own mess.Kafka, I have learned, is a very powerful tool. And like all shiny new tools, deeply prone to misuse.

评论 #25348053 未加载

评论 #25350281 未加载

评论 #25350143 未加载

评论 #25354009 未加载

hodgesrm超过 4 年前

Is this really a thing? Do people really try to use Kafka as the system of record for financial transactions or similar data?

评论 #25349511 未加载

评论 #25350075 未加载

评论 #25350399 未加载

jgraettinger1超过 4 年前

This post doesn't mention the _actual_ answer, which is to:1) Write a event recording a _desire_ to checkout. 2) Build a view of checkout decisions, which compares requests against inventory levels and produces checkout _results_. This is a stateful stream/stream join. 3) Read out the checkout decision to respond to the user, or send them an email, or whatever.CDC is great and all, too, but there are architectures where ^ makes more sense than sticking a database in front.Admittedly working up highly available, stateful stream-stream joins which aren't challenging to operate in production is... hard, but getting better.

评论 #25350154 未加载

vladsanchez超过 4 年前

I want to Upvote this more than once. So much facts into a condensed into a small essay. Good job!Money quote: "Event-sourced architectures like these suffer many such isolation anomalies, which constantly gaslight users with “time travel” behavior that we’re all familiar with."

评论 #25351813 未加载

theptip超过 4 年前

This is a bit dumbed down, and ignores the domain terminology required to properly discuss the trade-offs here (which is puzzling given that it links to a post by Aphyr, where you can find incredibly thorough discussions around isolation levels and anomalies).> The fundamental problem with using Kafka as your primary data store is it provides no isolation.This is false. I can only assume the author doesn't know about the Kafka transactions feature?To be specific, Kafka's transaction machinery offers read-committed isolation, and you get read-uncommitted by default if you don't opt-in to use that transaction machinery (the docs: <a href="https://kafka.apache.org/0110/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html" rel="nofollow">https://kafka.apache.org/0110/javadoc/index.html?org/apache/...</a>). Depending on your workload, read-committed might be sufficient for correctness, in which case you can absolutely use Kafka as your database.Of course, proving that your application is sound with just read-committed isolation is can be challenging, not to mention testing that your application continues to be sound as new features are added.Because of that, in general I think that the underlying point of this article is probably correct, in that you probably shouldn't use Kafka as your database -- but for certain applications / use-cases it's a completely valid system design choice.More generally this is an area that many applications get wrong by using the wrong isolation levels, because most frameworks encourage incorrect implementations by their unsafe defaults; e.g. see the classic "Feral concurrency control" paper <a href="http://www.bailis.org/papers/feral-sigmod2015.pdf" rel="nofollow">http://www.bailis.org/papers/feral-sigmod2015.pdf</a>. So I think the general message of "don't use Kafka as your DB unless you know enough about consistency to convince yourself that read-committed isolation is and will always be sufficient for your usecase" would be more appropriate (though it's certainly a less snappy title).

评论 #25350420 未加载

lmm超过 4 年前

> The problem we now have is called write skew. Our reads from the inventory view can be out of date by the time the checkout event is processed. If two users try to buy the same item at nearly the same time, they will both succeed, and we won’t have enough inventory for them both.And you'll have exactly the same problem if you're using a traditional ACID database: the user saw the item as being available, clicked buy, but it was unavailable by the they went to get it. Using an ACID database doesn't gain you anything; you might as well just use Kafka for everything.

评论 #25354972 未加载

评论 #25354597 未加载

fouc超过 4 年前

Any sufficiently complex software will end up implementing a database.

评论 #25347448 未加载

评论 #25351293 未加载

arthurcolle超过 4 年前

I had an issue with RabbitMQ where I didn't know how my consumer was going to use the data that I was writing to a queue yet (from a producer that was listening on a SocketIO or WebSockets stream), and I was kind of just going to figure it out in an hour or something.Eventually, my buffer ran out of memory and I couldn't write anything else to it, and it was dropping lots of messages. I was bummed. Is there a way to avoid this in Kafka?

评论 #25351598 未加载

amai超过 4 年前

Elasticsearch is also not a database.

评论 #25350024 未加载

diehunde超过 4 年前

Relevant to the discussion:Martin Kleppmann | Kafka Summit SF 2018 Keynote (Is Kafka a Database?) [1][1] <a href="https://www.youtube.com/watch?v=v2RJQELoM6Y" rel="nofollow">https://www.youtube.com/watch?v=v2RJQELoM6Y</a>

based2超过 4 年前

<a href="https://www.postgresql.org/docs/10/rules-materializedviews.html" rel="nofollow">https://www.postgresql.org/docs/10/rules-materializedviews.h...</a>

EamonnMR超过 4 年前

Kafka is a very nice communication channel. You can dump the results into a database and query it if you need a database.

tutfbhuf超过 4 年前

Well, then you have never heard of ksqlDB. It adds SQL and DB features to Kafka. It is backed by Confluent (LinkedIn) same company that developed Kafka initially.<a href="https://ksqldb.io" rel="nofollow">https://ksqldb.io</a>

评论 #25348000 未加载

评论 #25348322 未加载

somurzakov超过 4 年前

How does using Kafka come into play when implementing Actor model?Anyone successfully implemented actor model framework over kafka?interested in learning others' experience

joking超过 4 年前

neither has to be.

detay超过 4 年前

using the any tool for correct problem requires skills.

hasanic超过 4 年前

I mean, duh? Does Apache Kafka ever made the claim that it is a database?Other things that are not a database: Apache Traffic Server, Apache Mahout, Apache Jakarta, Apache ActiveMQ... hundreds of these exist.

评论 #25349644 未加载