Ask HN: Has anyone fully embraced an event-driven architecture?

284 pointsby sidewayalmost 4 years ago

After reading quite a few books and blog posts on event-driven architectures and comparing the suggested patterns with what I've seen myself in action, I keep wondering:Is there any company out there that has fully embraced this type of architecture when it comes to microservice communication, handling breaking schema changes or failures in an elegant way, and keeping engineers and other data consumers happy enough?Every event-driven architectural pattern I've read about can quite easily fall apart and I have yet to find satisfying answers on what to do when things go south. As a trivial example, everybody talks about dead-letter queues but nobody really explains how to handle messages that end up in one.Is there any non-sales community of professionals discussing this topic?Any help would be much appreciated.

65 comments

jfoutzalmost 4 years ago

You're not going to like the answer, but I think it captures some of what you're getting at.Windows 95. Old style gui programming meant sitting in a loop, waiting for the next event, then handling it. You type a letter, there's a case switch, and the next character is rendered on the screen. Being able to copy a file and type at the same time was a big deal. You'd experience the dead letter queue when you moved a window while the OS was handling a device, and the window would sort of smear across the screen when the repaint events were dropped.Concurrent programming is hard. State isolation from micro services helps a lot. but eventually you'll need to share state, and people try stuff like `add 1 to x`, but that has bugs, so they say, `if x == 7 add 1 to x` but that has bugs so they say, `my vector clock looks like foo. if your vector clock matches, add 1 to x, and give me back your updated vector clock` but now you've imposed a total order and have given up a lot of performance.I'm blind to the actual problem you're facing. My default recommendation is to have a monorepo, and break out helpers for expensive tasks, no different than spinning up a thread or process on a big host. Have a plan for building pipelines a->b->c->d. also have a plan for fan out a->b & a->c & a->dIt has been widely observed there are no silver bullets. but there are regular bullets. Careful and thoughtful use can be a huge win. If you're in that exponential growth phase, it's duck tape and baling wire all the way, get used to everything being broken all the time. if you're not, take your time and plan out a few steps ahead. Operationalize a few micro services. Get comfortable with the coordination and monitoring. Learn to recover gracefully, and hopefully find a way to dodge that problem next time around.Sorry this is hand wavy. I don't think you're missing anything. it's just hard. if you're stuck because it won't fit on 1X anymore, you've got to find a way to spread out the load.

评论 #28046320 未加载

评论 #28048796 未加载

评论 #28046593 未加载

评论 #28051758 未加载

评论 #28056288 未加载

评论 #28045997 未加载

evanrichalmost 4 years ago

Like others have said, it is just one tool in the tool box.We used Kafka for event-driven micro services quite a bit at Uber. I lead the team that owned schemas on Kafka there for a while. We just did not accept breaking schema changes within the topic. Same as you would expect from any other public-facing API. We also didnt allow multiplexing the topic with multiple schemas. This wasn’t just because it made my life easier. A large portion of the topics we had went on to become analytical tables in Hive. Breaking changes would break those tables. If you absolutely have to break the schema, make a new topic with a new consumer and phase out the old. This puts a lot of onus on the producer, so we tried to make tools to help. We had a central schema registry with the topics the schemas paired to that showed producers who their consumers were, so if breaking changes absolutely had to happen, they knew who to talk to. In practice though, we never got much pushback on the no-breaking changes rule.DLQ practices were decided by teams based on need, too many things there to consider to make blanket rules. When in your code did it fail? Is this consumer idempotent? Have we consumed a more recent event that would have over-written this event? Are you paying for some API that your auto-retry churning away in your DLQ is going to cost you a ton of money? Sometimes you may not even want a DLQ, you want a poison pill. That lets you assess what is happening immediately and not have to worry about replays at all.I hope one of the books you are talking about is Designing Data Intensive Applications, because it is really fantastic. I joke that it is frustrating that so much of what I learned over years on the data team could be written so succinctly in a book.

评论 #28046681 未加载

评论 #28046281 未加载

gwbas1calmost 4 years ago

I've been working on a contract for 6 months where the architecture is microservices and queues.IMO: It's over-complicated. They can ship a change to a microservice while insulating the other services from risk, but that's just kicking the can for technical debt.What happens is that, if a service hasn't shipped in a few release cycles, when an update is made to that service, we often find latent bugs. Typically they are the kind of bugs that could be found with simple regression testing; but the company put too much effort into dividing its code into silos. (Basically, they spent a lot of time dealing with the boundaries between their microservies instead of just writing clean, testable code with decent regression suites.)---IMO: Don't get too hung up on microservices and events. Focus on writing simple, straightforward code that's easily testable. Make sure you have high unit test coverage and a useful regression suite. Only introduce "microservice" boundaries when there's natural divisions. (IE, one service makes more sense to write in Node.js, another makes more sense to write in C#; or one service should run in Azure and another should run in AWS.)This, BTW, helped immensely in a previous job. When I worked for Syncplicity, a major Dropbox competitor, we started with a monolithic C# server for most server-side logic, but we had a microservice in AWS to handle uploads and downloads. This helped immensely, because we ended up allowing customers to host their own version of the upload / download server. It was a critical differentiator for us in the marketplace.

评论 #28053618 未加载

BulgarianIdiotalmost 4 years ago

There's no such architecture, much like there's no "MVC architecture" or "CQRS architecture". These are patterns that should be used specifically in time and space where and when pros outweigh cons.Anyone calling themselves an architect, or an engineer, or even just a "good developer" would acknowledge that interaction patterns and concepts are contextual, not general idioms at the project or system level.Speaking of them as "architectures" or embracing them, as in, doing everything the one holy way is only a crutch for people who are confused by what it means to define your system's architecture. And a silver bullet for consultants to sell you books and training.There is a lot of empty hype and misconceptions around EDA, for example "it helps decouple services" is thrown around, which is nonsense to anyone who can analyze a system and knows what a dependency is (moving from "I tell you" from "you tell me" is not more decoupled, you just moved the coupling; likewise moving from "A tells B" to "everyone tells B and B tells everyone" as in event hubs is much more coupling, it plays the role of a global system variable basically).Regarding dead letters, a most trivial answer is log and notify the stakeholders for unconsumed messages. That's the most general approach. Think about dead letter messages the same as exceptions that bubbled to the top of the stack. And when you can handle them more specifically, you do.

评论 #28047326 未加载

monocasaalmost 4 years ago

Not for a company, but I've embraced it pretty hard for my home automation. It's sort of the hammer I hit everything hard enough with until it looks like a nail by making everything go through the MQTT broker. The website? A static json blob describes interesting MQTT topics, and opens a MQTT over websocket connection to read/write any state. Zigbee, et al.? Translate to MQTT. Reporting? Daemon that listens to all topics and dumps it in a Sqlite database to be queried at my leisure. Events like sprinklers on/off? Python scripts in cron jobs that talk to everything via MQTT.Basically everything that makes fully event driven architectures difficult is ameliroated because the only consumers are myself and my wife, and we literally built up the whole system. Something appears to be locked up? There's a system of watchdogs to kill stuff, all hardware has been designed to fail off into manual control, and we can pick the pieces up at our leisure like when anything else in the house breaks. The last will and testament messages in MQTT are really nice for at least reporting hard failure conditions.I'll be the first to admit that I would not look forward to productizing it and supporting someone else's house (to the point that I'll probably never do that). It's so easy for messages to make their way into the bit bucket when setting up a new subsystem, and everything is so loosely coupled because of the event system it's almost like it's all "stringly typed". And being both software engineers, we sort of relish in how awful the UI is, even using 98.css.

评论 #28046446 未加载

评论 #28045776 未加载

评论 #28095857 未加载

评论 #28046733 未加载

jettialmost 4 years ago

Where I currently work we are all in on event-driven architecture. For our DLQs, we have alerts on when the queue is growing in size or if messages are in the queue too long. When those alerts come in, we manually move the messages back to the normal queue for reprocessing and if they get DLQed again after that we will look into the reason it is failing.One of the benefits of this architecture for us is the ability to easily share information between services. We utilize SNS and SQS for a pub/sub architecture so if we need to expose more information we can just publish another type of message to the topic or if we need to consume some information then we can just listen to the relevant topic.There are two big issues that I've run into while at this company. One is tracking down where events are coming from can be a big pain, especially as we are replacing services but keeping message formats the same. The other big issue is setting up lower environments (dev,qa,etc) can be difficult because you pretty much need the entire ecosystem in order for the environment to be usable, which requires buy-in from all teams in the organization

评论 #28045969 未加载

评论 #28045273 未加载

评论 #28046723 未加载

zamalekalmost 4 years ago

"Everything looks like a red thumb when you're holding a golden hammer."Events are a part of a greater whole. It's a tool that you can use to solve certain data flows, but not all data flows. When you start taking more liberty with the word "eventually," you are almost certainly in a realm where event-driven makes the most sense. CQRS is a pretty good example of using many architectures (including event-driven) under a single greater architectural umbrella, and the thought patterns it introduces you to are incredibly useful. But no architecture is gospel, not even close.Any "pure" architecture is the tail wagging the dog. The problem comes first, the solution comes second, the architecture comes third.

bob1029almost 4 years ago

I have worked in places where event-driven architectures are a necessity (we're talking thousands of real-time systems being integrated together).If you want to use event driven + microservices, first make sure microservices make sense. Event driven is just a cool way to tie monstrous collections of services together if you have to go down that road.If you can build a monolith and satisfy the business objectives, you should almost certainly do it. With a monolith, entire classes of things you would want to discuss on hackernews (such as event-driven architectures) evaporate into nothingness.

评论 #28049513 未加载

jf22almost 4 years ago

I did at an old company.It was great for certain use cases, bad for others. The architecture made it so it took days to do simple features like adding a sortable column.Having to deal with that made it the worst job I've ever had. It would take 700 lines of code involving two separate systems and 70 hours to to do tasks that would normally take two hours. I felt a lot of pressure because previously simple tasks would take so long.

rkangelalmost 4 years ago

The closest thing I know of is the Erlang/Elixir approach to program development. The BEAM VM that they're built on, is basically an instantiation of the actor programming model - a series of logical processes (services) that only communicate with each other through messages. Any state is held in an actor and you work with that state in an event driven way based on the messages you receive. I'll give a little peek at this below, but really you'd need to work with it to see how well it works at application scale.In well architected Erlang/Elixir, most of your business logic will be written as pure functional code (which is gloriously easy to test), but then it is glued together at the boundary by GenServers (usually). GenServers are an abstraction over the BEAM primitives that makes the 'receive a message, update my state' thing very easy. The simplest handler might look like this:<pre><code> def handle_call(:increment, _from, state) do {:reply, state, state + 1} end </code></pre> Here state is a simple integer. When we receive the :increment message, we send back the current value and increment our local state. The way all this is wrapped up, the caller has an API that just looks like a function call which returns the value but the underlying architecture that you're working with is all event driven.

zzbzqalmost 4 years ago

Events are a part of any good service-oriented architecture. They can replace patterns that involve batch-ETLing large amounts of data from system to system--events are usually a smoother way of doing the same. They're usually more resource-efficient and responsive than a poll & cache approach. They can also create a more consistent way to broadcast data, avoiding CAP problems from trying to do multiple writes to different systems, and preventing systems from devolving into anti-patterns where a system from a business domain gets misused as a message bus for another system.Using events into a processing queue is also a good way to make systems more responsive for end-users when compared to making every operation blocking.Events are not a good replacement for transactional request/response models of (i.e.,, making an API call.) Some people advocate for a "event sourcing" system to create its own internal domain model using events. I don't think this is a good default, but it really comes down to is what tools you're using an how you're used to using them. Namely, you can't have a web service that writes to a RDMBS and then immediately writes to RabbitMQ and call it a consistent system, because the write to RabbitMQ could fail and the systems downstream would be permanently wrong. So event sourcing is used to resolve this into a single-write into a queue system which then forks out into the systems' own RDMBS and also other systems. However the more "normal" way would be to just write this atomically into the RDBMS and have a second process poll it back out into the queue for downstream systems.

HALtheWisealmost 4 years ago

By and large, all of industry and academia working on modern robotics systems have converged on using event-driven publish/subscribe message busses for basically everything. For example, a camera driver will produce a stream of "image" events that the trigger other code across the system, all the way until a stream of "motor command" events come out the other side. This model is really valuable because it works so well with logging and replay workflows, and because it makes mocking and replacing different parts of the system really easy (up to and including mocking reality with a simulator). ROS is the major open source framework used in academia, and industry is split between using ROS and building proprietary internal protocols with similar functionalities.It's not an exact match for the scalable-microservices world you're thinking of (for example, typically robots don't need to deal with runtime schema version skew), but could be interesting to learn about anyway.

xet7almost 4 years ago

Monolith is easier to handle. With microservices, any network connection could break, you need a lot more code to handle all that complexity and orchestration.

评论 #28048277 未加载

评论 #28048595 未加载

navdalmost 4 years ago

Not sure if there are any communities. My general advice is to invest as much as possible in a good logging solution, traceability, and just general things to make debugging easier. Come up with a way to replay events easily. You'll thank yourself everyday a bug or issue pops up.

评论 #28046017 未加载

评论 #28041023 未加载

评论 #28046968 未加载

Fiahilalmost 4 years ago

I work for a big AI consultancy. Most of the time we build ETLs for the data-engineering side, in a client driven capacity-building effort. We do this because our focus is on Data Science, not data engineering, and we often work in situations where the client doesn't have an existing data science platform. It's simpler to build, to handover and later to maintain.In projects where the client already have a mature engineering and data science department, we bring the big guns! The scope is usually much larger, with several workstreams and involve production-ready deployments. In this situation we might build upon what the client already have (ETLs), or initiate a full event-driven transformation with a "backbone" team responsible for creating a platform, and several use cases building upon it. In the usual scenario, a team would want to start large computations or simulations upon recieving a trigger event from a monitoring system (model drift) or a human operator ("what would be the impact in € of a small decrease in parameter X over the next 7 days of forecasted sales"?).Even-driven systems are much more robust than traditional ETLs with a central data warehouse, but they are also much more complex to understand and operate. In the end, we rarely deploy them because they cost us way too much engineering time compared to the benefits. That's mostly because we spend >70% of our time dealing with "security teams" and "access issues". Seriously.

Licentiaalmost 4 years ago

Yes. It's my preferred architecture for any non trivial system. The single biggest downside to it is it's really hard to find people with experience building event driven systems.There's a bit of a training curve but it's honestly not that hard if people are willing and wanting to learn. You could level up a moderately experienced team in a matter of weeks to be able to work within a well defined event driven microservice architecture. The part that gets tricky and requires experience is carving out the boundaries and messages.To answer the question about DLQs I think this is a valid critique. I've seen many places just set and forget DLQs and they might as well not have them. For me, I like to start each DLQ with an alert on every message published. Then manually inspect the message, trace the logs and figure out what to do from there. Once I have enough data on failure modes and paths to rectify them, you can start automating DLQ processing. In general though DLQs should not see much traffic outside of a system going down or poison messages hitting your services (broken schema changes from another service)

thirealmost 4 years ago

I had experience with a product using an event based architecture at large scale, and to be honest, it was a pain to work with. For example, traceability, or troubleshooting in general, was very hard since events would spawn more events etc. making things much harder to track than expected.Unless the scale is an issue, nowadays I always prefer a more state-full approach when possible.

gwbas1calmost 4 years ago

(Kind of related)I was the lead developer for Syncplicity's desktop client. It was a file synchronization product very similar to Dropbox.When I joined, the desktop client was 100% event-driven. The problem is that some kinds of operations need to be performed synchronously, so "event-driven" tended to obfuscate what needed to happen when. Translation: For your primary use cases, it's much easier to follow code that calls functions in a well-defined order, instead of figuring out who's subscribed to what. Events are great for secondary use cases.To translate to microservices, for primary use cases, I'd have the service that generates a result call directly into the next service. For secondary use cases, I'd rely on events. Of course, there's tradeoffs everywhere, but you'll find that newcomers are able to more easily navigate your codebase when it's unambiguous what happens next.

评论 #28048947 未加载

giantg2almost 4 years ago

"As a trivial example, everybody talks about dead-letter queues but nobody really explains how to handle messages that end up in one."For us, it's either a function that will retry the messages after some time, or manual intervention.Our department recently said we need to move to even driven architecture for one of our processes that currently runs in a batch. They want us to load data into EMR from an S3 bucket populated by Kinesis. Their suggested implementation is simply to run the batch job more frequently instead of once a day... sorry guys, but that's not even driven...I suggested maybe just setting a trigger on the S3 bucket and hook it up to Glue, since that would actually be event driven. They said 'no' because they don't want the load to EMR or Glue to run too frequently. I guess that makes sense (not that familiar with the ETL tech), but it sure doesn't make sense to call it event driven.

评论 #28039854 未加载

dm3almost 4 years ago

A lot of great comments here.The closest community to what you are asking for here is probably the DDD/CQRS/ES[1] Slack[2]. Google groups are pretty much dead at this point.[1]: <a href="https://github.com/ddd-cqrs-es" rel="nofollow">https://github.com/ddd-cqrs-es</a> [2]: <a href="https://ddd-cqrs-es.slack.com/" rel="nofollow">https://ddd-cqrs-es.slack.com/</a>

tomaskafkaalmost 4 years ago

No. In my programs I try to hold John Carmack's advice, that there is no better understandable structure than a large program that you can read from start to end. He is talking about a main game loop, but I found that this advice holds. Nothing beats being able to step the function step by step and see all the variables.

lolskialmost 4 years ago

I can say we're one of the companies that have successfully embraced event-driven design. We're Vaticle and we're not a microservice shop - rather, we're building a database software called TypeDB. The internals are quite event-driven mainly realised with the actor model and event-loop concurrency.It has allowed us to scale mainly in two ways: maximising parallelism with respect to CPU, and doing other works while waiting for an RPC call to return.Event-driven architecture by nature is more parallel and efficient, but comes with a weaker consistency guarantee when it comes to the ordering of events coming from multiple parallel sources.In my experience, people tend to fall prey to these pitfalls, and ended up resorting to inappropriate workaround such as global locks and ad-hoc retry mechanism. These are most commonly done when trying to aggregate works coming from concurrent producers or when needing to handle communication failures.In fact, communication failures and downtimes are the most prominent problem in microservice particularly when you need your data to be inserted into multiple data sources in an atomic way.This is an inherent issue in distributed systems and you have to think what's the atomic unit of data that you wish to insert, and design your system based on this hard constraint. Making the operations atomic, idempotent or revertable are some of the solutions you may want to investigate, but the moral of the story, is that you need to make sure these additional complexities are justified.For us as a company, we decided on the event-driven architecture after knowing not just the benefit, but also the cost that I've outlined above.For simpler applications that don't need to be a) real-time and b) handle crazy amount of loads, think small internal applications, small business ecommerce website, I would resort back to good old non-event-driven system since it's the more pragmatic option.I've seen several companies building an event-driven architecture even when they know there's no way they would need to scale beyond serving several thousands of request per hour in the next two years. I think they would've been better off with a simpler, synchronous model.

microDudealmost 4 years ago

I work in semiconductor manufacturing, it's a very common model. I have been in about eight fabs around the world that use it quite successfully.

monksyalmost 4 years ago

I have at a few companies now.It's great for processing data that goes beyond a single database call, data formatting and presenting something on a page.If you're chaining multiple microservices together you've made a very sloppy/poor mans version of this. (People tend not to account for downed services, maintence updates, client durability, etc) When you bring in technologies like Kafka to orchestrate this, you'll end up with a more reliable system that you can fix if something goes wrong. This more changes the way you think about data and how you present it. Also, it'll increase your uptime because your service's SLA is isolated from what you're processing. (Your service and the persistent storage that is storing state is what people see.. the data being out of date is something you should account for)Schema changes: Generally you don't have that big of a deal because multiple applications get started at once. You should have system level tests to catch that before you go out. Also, application smoke tests help as well. As long as you picked a durable message queue with a framework that'll crash on error, you can fix that, bring up the fix and continue processing through.Dead letter queues: It's more about how you architect more than anything. This is something you should plan for.

评论 #28045572 未加载

TuringNYCalmost 4 years ago

I saw a number of investment houses (mostly sell-side) do this. This was in the age of ESB (Enterprise Service Bus.)The architecture made sense since events (new trades or quotes) dictate a host of downstream activites, which often need to be near-real-time reduce divergence.

marto1almost 4 years ago

It's kinda like using Lisp. It may be great, but it's harder to find friends :-)

soperjalmost 4 years ago

Event driven architecture honestly seems like a different flavour of all the things we hate with spaghetti code gotos.

wly_cdgralmost 4 years ago

Well, every game company :)Makes sense robotics would do it too. AAA games and robotics are basically the same field after all

jon-woodalmost 4 years ago

We definitely haven’t fully embraced an event driven architecture, but we have gone all in on it where it makes sense for us. We’re processing in the order of a billion messages a day from hardware in customer’s homes, which is probably the ideal use case for event based comms. Handling devices being offline for a while becomes much simpler when the response to that can just be queuing up events and transmitting when available.One of the key lessons we learned was that your event ingest needs to be rock solid. Put up a service, and then make it do one thing only, receive an event and throw it onto the message bus - we do this for messages from devices, but also for 3rd party services which send us webhook notifications. If ingest fails so does everything else, don’t let that happen.The other thing I’d say is to make sure there’s a central source of truth for what services are consuming which messages. We made the mistake early on of saying services should be responsible for setting up their own subscriptions and it’s made it much more difficult to answer questions around who’s going to be impacted by changes or outages. At a minimum have a wiki page on it. Ideally manage subscriptions in Terraform or similar.Finally, DLQs. Typically we don’t do much with them, we have logging of messages that get pushed to the DLQ and usually it’s a non-recoverable error, often around validation or accounts being disabled. They are handy in the case of an outage though as you can just push all the messages back into the queue when things recover.

bullenalmost 4 years ago

Yes, I made my own open-source event driven platform: <a href="http://github.com/tinspin" rel="nofollow">http://github.com/tinspin</a> (rupy is the foundation and fuse is an example implementation tested with 350.000 users and 5 years uptime)The learnings where 2-fold:1) You need async-to-async capable db clients so that you use 4 threads (potentially on separate cores) for each browser <-> server <-> database roundtrip.Since most databases don't have async capable clients I wrote my own database too: <a href="http://root.rupy.se" rel="nofollow">http://root.rupy.se</a>2) You should use a VM + GC language so that you can use atomic shared memory between cores, that way any core can handle any request efficiently (and access other users memory).This part is very hard to prove in theory, but in practice I'm baffled by how well Java performs, you can find three quotes that I managed find here: <a href="https://github.com/tinspin/rupy/wiki" rel="nofollow">https://github.com/tinspin/rupy/wiki</a>Finally getting threads to cooporate on things is hard and you cannot debug it with any tools, instead you have to use "trial and error" until is sort of works all the time.

评论 #28061569 未加载

swader999almost 4 years ago

Operational support is more interesting with this kind of an architecture. Dealing with message queues and all that can be challenging for a traditional organization.

8notealmost 4 years ago

Some teams parallel to mine have an event based contract with their upstream, vs my team has a service contract.We've been doing some refractors to combine common systems, so now the same team is upstream for both.Talking to the sister teams, theyrepretty unhappy about their relationship with the upstream and are trying to avoid and replace them internally, vs we quite enjoy them.I think the big difference is in mental model. When you're passed an event stream, the producer doesn't care about the events going out, and it's on you to handle all of them, and for failures, you have to reach out to whoever made failures in the upstream system, rather than the upstream team doing it. Otoh, for a service call, you only need to throw an exception for a bad event, and the upstream team is responsible for communicating the failure.The more event based interfaces you have between your team and the folks making the change, the harder it gets to tell them that they're doing something wrong, and the less you even know about what they're doing or how to find them.Mind you, immediately after the service call, we put messages on a queue. Distributing the message between systems we own works just fine

polskibusalmost 4 years ago

I'm guessing the OP meant event driven with persistent events. If the events exist only at runtime then a lot of burden related to schema evolution disappears.We tried this approach a couple of times with mixed results. In one case, for a new component with strict DDD modeling - it was a boom to productivity. In others , preexisting once we never got to realize the gains, invested quite a bit , not sure if we ever get the upside.

评论 #28046743 未加载

phaedrusalmost 4 years ago

I maintain a legacy application which was written in a fully event-driven way. As another commenter mentioned, native Windows programs work this way, but it is not just because this is a Win32 program. The original author(s) of this application also embraced a multi-threaded paradigm and use their own homegrown asynchronous serial message and event system (built on top of Windows messages).It's terrible. The reason it's terrible is that you can't use function call graphs, single stepping, or call stacks to debug this application. Everything happens indirectly by one part of the application throwing a message in a bottle into the ocean, and another part of the application (running on another thread) finding the bottle at a later time. And every message is written in a different binary format (different memcpy'd structs) which is mutually intelligible to each sender-receiver pair and no one else.Troubleshooting and understanding this system is more akin to endocrinology or ecology than math or engineering.

EamonnMRalmost 4 years ago

Much of our codebase is python microservices communicating via Kafka. Once you get past the hurdle of getting kafka connected it's pretty reliable. We have a shared library for producing and consuming so we don't need to reinvent the wheel for new services. We also dump the resulting messages as rows into a database. It works very well.

评论 #28046959 未加载

FpUseralmost 4 years ago

>"fully embraced""fully embraces" / commits - not a very wise thing to do. Event driven arch is one of many tools at your disposal. It is awesome for some things and not so much for others. You can't use single tool / approach for everything and expect best results.

评论 #28048152 未加载

restersalmost 4 years ago

There are many examples of a "hello world" for event-driven architecture, but there doesn't seem to be the equivalent of the "sinatra/express" of event-driven architecture, a minimalistic foundation to build a customized platform using the approach. Things like schema migration layer would nicely onto something like this.Most event-driven systems are big company projects with a lot of legacy requirements and integration complexity, or they are narrowly tailored and hard to generalize.I think that a simple template event-driven system that includes a small number of libraries and does something simple but interesting would be a big help.I've been wanting to create an open source event-driven e-commerce system but haven't had the time.

tyingqalmost 4 years ago

"everybody talks about dead-letter queues but nobody really explains how to handle messages that end up in one"It varies, but one common example is that the queue is worked by humans. This is pretty common, for example, in travel, for things like accidental overbookings.

thdxralmost 4 years ago

The overhead with this architecture can be cumbersome. Which is why most successful deployments of it tend to be with teams that have embraced full stack serverless. Recommend exploring that community, plenty of event driven systems there.

评论 #28048896 未加载

评论 #28046787 未加载

xwolfialmost 4 years ago

Yes - I work in an investment bank, we try to do millisecond-level latencies for our order management system that sends client orders to the various exchanges(for sub-milli we use FPGA but it's very expensive and only for some clients).It works alright (like you I hated it all before, coming from more amateurish implementations). It's slow to change (adding a new event type can take years before it works everywhere), failures can happen - for instance if the message parsing library has a crushing bug and an extendable attribute zone has a poison pill that never appeared before - well nothing you can do but manually editing the event source.What it brings us, I suppose, is that every micro service is single threaded, all events are well-recorded , we use multicast to transfer them from one sequencer to all consumers so we just need good routers and TCP-level message building - it's very barebone to keep it fast, extensibility for us is mainly on adding more services around a core stream that we don't really need to change all that much (we do a lot of regulatory validations, data analysis, the odd scale out for a round-robin compatible process - not all of them are, some need to see all events, for instance for cumulative exposure calculation - client stock exposure on several markets for risk-based decisions).It also avoid latencies like in state-request based systems, since each service will build its own state machine. We make a lot of money on this system, so we can hire hundreds of people around the world to maintain it.At this point I dont see how to do it better (5ms round trip to the client if low validation, 100ms if crossing seawater to an exchange with short selling validations) without events, but I know very well that doing it for simpler flows that are not latency-constrained will probably result in heavy cost and low gain. I would never recommend it for people who aren't already struggling with a fully-fledged business implementation that makes money they want to accelerate.The problems we face are:- it's slow to evolve the very core of the system- we need perfect ordering, there will always be time wasted at the sequencer to transform unordered "commands" for the various services into perfectly ordered events- testing and debugging is an art that takes time to acquire: I can now, but the task seemed daunting when I started - how to spin up the minimal surface of services to make a valuable replay, how to make static configuration reproduce production's behavior exactly so that all services behave the way you want to reproduce if you're investigating a sequence-based issue (rather than a function-based issue)- it takes up to a year for a new Java dev to get productive on such an exotic mindset, but it's also because we do no intraday malloc since we cannot afford a GC in the middle of a client order- management cannot understand why they can't cut cost using the cloud, virtual machines, vendor databases etc. Even in a company that makes billions over 20 years on this system, we still can't explain it in a way that sounds valuable vs its cost. Because its cost probably is extremely high, and can't be outsourced by hit and run consulting managers before being brought in-house again. So we're not like the most popular dev, we're the slow and expensive ones :(

评论 #28046844 未加载

the__alchemistalmost 4 years ago

My most recent (embedded) programs are almost-entirely interrupt driven, with a main loop containing only a wait-for-interrupt. So, all actions are driven by a GPIO event, RTC or timer wakeup, USB request etc.

blablabla123almost 4 years ago

I've worked on a project that was fully event-driven realized as Microservices. (I think a few of the externally connected Microservices were event-driven as well but not all.) That was all roll-your-own without framework. So things could break. But the philosophy was more like: everything is written in a very lightweight and simple way. So if it breaks, it can be fixed swiftly. I've also seen a similar approach at another place. FWIW both places had unusually high availability requirements. ("Fail fast"...)

nitwit005almost 4 years ago

Used it for fully transient services. If search died, it could be rebuilt from scratch. If chat died, data about active chats would be lost, but no one really cared.The main issue with using it with other services was answering questions like "how do we restore a database backup?" or "what do we do if Kafka blows up, and we lose messages?". It's possible to create a design that won't get into an inconsistent state when bad things happen, but people tend to greatly underestimate how hard it is.

MichaelMoser123almost 4 years ago

nginx is fully event driven <a href="https://github.com/nginx/nginx" rel="nofollow">https://github.com/nginx/nginx</a> here is a list of what nginx is using for event handling on the different operating systems: <a href="https://nginx.org/en/docs/events.html" rel="nofollow">https://nginx.org/en/docs/events.html</a>i don't know of any big project using the newer io_uring, does anyone know some big examples of io_uring usage?

评论 #28052149 未加载

_3u10almost 4 years ago

Yes, most MVC platforms are event driven.Since you mentioned schema breakage what I imagine you’re doing is inner platform effect as your API / database already supports everything you are trying to reinvent, just use Postgrest and views.When you break a view you know you’re making incompatible changes. Stop right there and either version the view, or figure out how to add your feature without breaking the view.It’s pretty easy to avoid making breaking API changes.

valzamalmost 4 years ago

I recently got recommended this video, haven't watched it yet but given that the speaker is Kleppmann I suspect it will be very helpfulThinking in Events: From Databases to Distributed Collaboration Software<a href="https://www.youtube.com/watch?v=ePHpAPacOdI&list=WL&index=1&t=38s" rel="nofollow">https://www.youtube.com/watch?v=ePHpAPacOdI&list=WL&index=1&...</a>

zarkov99almost 4 years ago

In finance, specifically in trading systems, event-driven is a natural fit and in my experience the default to which systems converge to.

sae3023almost 4 years ago

I've worked at a company that has launched at least one product where back-end was entirely event-sourced.Apart from using event sourcing and CQRS, they take DDD very seriously.They use a self-made open source framework, which has very good JavaDocs. <a href="https://github.com/SpineEventEngine/" rel="nofollow">https://github.com/SpineEventEngine/</a>

评论 #28048135 未加载

purpleideaalmost 4 years ago

Yes, have a look at <a href="https://github.com/purpleidea/mgmt/" rel="nofollow">https://github.com/purpleidea/mgmt/</a> Not at a 1.0 release yet, but there's enough for you to have fun with. LMK

arodygincalmost 4 years ago

I think the best model that describes event-driven approach is Petri nets. The theory is quite simple, yet powerful.Though is difficult to implement in a straightforward way, it is able to take your mind in the right direction.

pbreitalmost 4 years ago

Are there any rules of thumb where such an architecture should be considered? >X TPS? >Y milliseconds per txn? >Z milliseconds between write and subsequent read? Eventual consistency OK?

评论 #28045965 未加载

cturneralmost 4 years ago

Idea: have event streams between microservices always be bilateral (two party) contracts. When you want to make a change, you can pick up the phone to the other end and get it done quickly.Multilateral contracts easily lead to email chains or meetings over each change, and compromise data structures, similar to the “add a column on the end” culture often used in shared reldbs.What is your motivation for wanting microservices? As an alternative, what about events between the processes of single codebase? In that case, when you want a schema change - change it, run your tests to see that all modules comply, redeploy everything.

zarathustrealalmost 4 years ago

If you think about it, modern front end web development (with React + Redux) is event-driven! In a way, there are tons of people embracing it

Animatsalmost 4 years ago

Window systems are classically event-driven. Especially earlier single-thread ones from Microsoft.

Ericson2314almost 4 years ago

them kids and their microservices...

kwdcalmost 4 years ago

Got a list of books for reference?

评论 #28045878 未加载

评论 #28046883 未加载

austincheneyalmost 4 years ago

A fully event driven service based application I wrote that matches file system interaction to peer to peer networking:<a href="https://github.com/prettydiff/share-file-systems" rel="nofollow">https://github.com/prettydiff/share-file-systems</a>

nitrixalmost 4 years ago

You mean Erlang?

评论 #28045739 未加载

Graffuralmost 4 years ago

Would be very interested to know!

sidcoolalmost 4 years ago

Yes. I am interested in knowing what was done before event driven architecture to make robust asynchronous systems?

bob_robotoalmost 4 years ago

It depends on your definition of fully embraced. If you mean that there is no synchronous communication between services, then no, and neither does it make sense in the real-world scenarios I am aware of.However, I am an advocate of the pattern and have seen it used successfully repeatedly. The largest scale as the data lead for a product maintained by 100-200 developers and several thousand transactions per second.To answer your specific questions>handling breaking schema changes or failures in an elegant way, and keeping engineers and other data consumers happy enough?We did not allow for breaking schema changes. If there is a breaking change, it's a new event/topic. We used Kafka and every topic needed to have a compatibility scheme defined (see <a href="https://docs.confluent.io/platform/current/schema-registry/avro.html" rel="nofollow">https://docs.confluent.io/platform/current/schema-registry/a...</a>) to clarify what constitutes a breaking change. Even though some claim that producers and consumers can be fully decoupled, you will need to have a good idea who your consumers are and the time horizon of the data they consume. Application engineers are usually easier to keep happy than machine learning practitioners and other data consumers that want to consume events emitted over a long time period, potentially years.> As a trivial example, everybody talks about dead-letter queues but nobody really explains how to handle messages that end up in one.Dead letter queues are a tool you can use when the context demands it, applying it wholesale is likely creating too much overhead. But to provide you with a specific example. Some emitted events will be revenue impacting and depending on your setup, you actually want to use the events for financial reporting (careful! some more info later). In this specific use-case, if you can't process a record, the last thing you want to do is throw the message away. Somebody will need to have a look at these records, fix the cause and then either re-emit the records based on what you know about them from the header or fix the records in the DLQ. So think about the guarantees you need to provide and decide whether a DLQ makes sense for your use-case.Some other thoughts and considerations.- Topics more or less directly become analytics tables. Almost creating a unified view on your application's data otherwise difficult to create.- How are the messages emitted. Are the messages emitted from the application logic? If so, what guarantees do you need? What happens if the app crashes (e.g. after a DB transaction happens and before the event was emitted). Depending on what you need, have a look at the transaction outbox pattern.

jerfalmost 4 years ago

Here is an alternate point of view: Everything is event-based. Fundamentally, our universe is event based; things interact via events mediated by the "force-carrying particles". It's down there at the bottom. Even up here at massively higher levels, everything is fundamentally event-based.If that's the case, then why don't we write entirely in terms of events as our base architecture? It isn't because we are ignorant of event-based processing, it is because we want the other types of systems in our lives. Transactions don't really exist; they are an abstraction we add to certain elements of the world that are behaviors they perform in response to certain types of incoming events. API calls don't really exist, they are a stereotyped pattern of an event making a call and event sent back, tied to the first via some ID (which may be a TCP socket), with a response, and no further activity on that ID for that event, which is an abstraction we added on top of events for a certain very common type of call.Working directly with event-based systems is the software architectural equivalent of writing in assembler. Sometimes you have to do it because nothing else will do. However, you are dropping to a lower level, with all that implies, particularly the fact that you are now responsible for any of those nice properties that you want to enforce. Very similar to how if you want to use UDP, but you want some of the guarantees of TCP, you are now responsible for those guarantees. There's nothing wrong with that. It's just something you need to be aware of in making your decisions.Being the foundational architecture everything else is based on, event-based systems can do anything that is possible to do, again, quite similar to how assembler is what can do anything the CPU can do. The other abstractions function by providing limits on the event flows in the system for their power, again just as a higher-level language like C, or perhaps even more clearly Rust, simply can not be used to generate all possible assembler instruction sequences, because the way they fundamentally work is to exclude sequences from the set of all possible sequences to only contain sequences maintaining certain properties. Event-based systems can implement API-call-type sequences internally. Event-based systems can implement transactions by manually implementing all the requisite limitations of events and event ordering and what things do in response to events. Etc.But, if you have a system that needs any of those guarantees, it's kind of silly to start writing at an event-based level, only to have to painfully reconstruct the guarantees already available to you. As the Ancient Wisdom goes, "Any sufficiently complicated event-based system contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of TCP." Most of these abstractions are around for a reason.Where things go wrong is when the abstractions become detached from their underlying implementation in people's minds, and in their minds, become the base abstraction. Probably the single biggest instance in this space of that problem is treating "the API call" as "the fundamental abstraction". API calls are an incredibly useful abstraction, but, at the same time, a really terrible primitive to be the bottom of your system. To build the API abstraction out of event flows involves throwing away a lot of the capabilities of events. If an API call is what you need, and it is a very common need, that's a virtuous simplification, but when your needs exceed what an API call can do, you can really wreck up a design trying to implement an event system back on top of API calls. I've seen it at least twice in my career; one of the products I'm responsible for can almost literally be seen as a rewrite of a previous version that made that mistake in a manner that basically fatally killed it architecturally and replacing it with an event based system at the core... which I then immediately implement an API call layer on top of which mostly ran the system... but... right at the the critical place... didn't, and I could reach back down the stack and use the raw event-based system in the core for a few critical bits of functionality.This is my "alternate point of view". Everything is already event-based, even when you can't see it. However, that doesn't mean it's a good idea to work at that level of abstraction all the time any more than the fact CPUs run assembler means we should always be working in raw assembly code. It is not necessary to use raw events everywhere. It is not a betrayal of good design to have some API calls in your system, or a centralized transactional database, or even TCP (which most notably adds "ordering" to events). It is, however, necessary for software architects to understand that the event-based system underneath is fundamental, and that they view the other additional abstractions as islands of functionality based on the event-based core of the world underneath, and not confuse those islands with the bedrock. Many systems may not even expose event-based functionality at a raw level anywhere, but if you keep this principle in mind, if a raw-event use case ever pops up, your system will quite likely be ready to handle it.

swyxalmost 4 years ago

You seem to have conflated two things here - do event-driven architectures work at all, and has anyone "fully embraced" it. It doesn't have to be "fully embraced" to be successful.I started going down this rabbit hole a year ago (see the many good replies to this <a href="https://twitter.com/swyx/status/1241482183472295939?s=20" rel="nofollow">https://twitter.com/swyx/status/1241482183472295939?s=20</a>) and most people feel it is "hard to reason about", which often seems code for unfamiliar.What I and other people were lacking is a good framework to think about it. Unfortunately this has compromised my credibility to you as I left Amazon to go work on this very problem at <a href="https://temporal.io" rel="nofollow">https://temporal.io</a> this year. I'll try to give some thoughts for how we tackle this but wanted to give that disclaimer upfront - not trying to sell you anything other than "i think this architecture could work use whatever you want"1. DLQs - the AWS answer would be to wire up Lambda and SQS to build your own DLQ retry system (<a href="https://aws.amazon.com/blogs/compute/using-amazon-sqs-dead-letter-queues-to-replay-messages/" rel="nofollow">https://aws.amazon.com/blogs/compute/using-amazon-sqs-dead-l...</a>). This is a bunch of extra provisioning and coding. So you may want to use the retries built into Step Functions (<a href="https://aws.amazon.com/blogs/developer/handling-errors-retries-and-adding-alerting-to-step-function-state-machine-executions/" rel="nofollow">https://aws.amazon.com/blogs/developer/handling-errors-retri...</a>). But instead of learning a bespoke States Language and debugging-by-redeploying-cloudformation (sooo slow lol), you may wish to work in a proper programming language SDK you can run and test locally instead (this is Temporal.io's approach)2. Failures - any decent workflow engine will log and retry your failures for you, i wouldn't write my own logic for that these days3. Microservice communication - what problems do you foresee? need more here. We simply call them Signals (send data in) and Queries (get data out) and it works well.4. Breaking schema changes (versioning/migration) - yes this is really fragile unless you have a proper framework to bring this all together. We just build in versioning into our SDKs and give you a replay tools to verify you've handled still-running workflows (<a href="https://www.youtube.com/watch?v=kkP899WxgzY" rel="nofollow">https://www.youtube.com/watch?v=kkP899WxgzY</a>)5. Keeping engineers happy - this one REALLY depends what youre talking about but being able to write tests for your asynchronous/distributed system is important for increasing confidence, as is being able to work in your preferred language (polyglot microservices), making every part of the system horizontally scalable so you don't have random bottlenecks, having everything logged and persisted so you are resistant to network/machine failures and can figure out exactly what went wrong when it goes wrong... I could go on.Of course i'd love for more neutral users of workflow engines to chime in if I got anything wrong here. just trying to offer what I've learned so far working in this area.

评论 #28046820 未加载

ProfHewittalmost 4 years ago

"Event-driven architecture" only considers asmall subset of the computational events fundamental tounderstanding computation.Actor Theory is based on automatizing the "precedes" partialorder for all computational events.Proving properties of computation al systgems can beaccomplished using Actors Event Induction for computational events.For more information see the following video:<a href="https://www.youtube.com/watch?v=AJP1VL7shiI" rel="nofollow">https://www.youtube.com/watch?v=AJP1VL7shiI</a>

igituralmost 4 years ago

> Every event-driven architectural pattern I've read about can quite easily fall apart and I have yet to find satisfying answers on what to do when things go south.Just a kind request from someone living in the southern hemisphere not to use "south" as a synonym for "bad" or "fail".

65 comments

jfoutzalmost 4 years ago

评论 #28046320 未加载

评论 #28048796 未加载

评论 #28046593 未加载

评论 #28051758 未加载

评论 #28056288 未加载

评论 #28045997 未加载

evanrichalmost 4 years ago

评论 #28046681 未加载

评论 #28046281 未加载

gwbas1calmost 4 years ago

评论 #28053618 未加载

BulgarianIdiotalmost 4 years ago

评论 #28047326 未加载

monocasaalmost 4 years ago

评论 #28046446 未加载

评论 #28045776 未加载

评论 #28095857 未加载

评论 #28046733 未加载

jettialmost 4 years ago

评论 #28045969 未加载

评论 #28045273 未加载

评论 #28046723 未加载

zamalekalmost 4 years ago

bob1029almost 4 years ago

评论 #28049513 未加载

jf22almost 4 years ago

rkangelalmost 4 years ago

zzbzqalmost 4 years ago

HALtheWisealmost 4 years ago

xet7almost 4 years ago

Monolith is easier to handle. With microservices, any network connection could break, you need a lot more code to handle all that complexity and orchestration.

评论 #28048277 未加载

评论 #28048595 未加载

navdalmost 4 years ago

评论 #28046017 未加载

评论 #28041023 未加载

评论 #28046968 未加载

Fiahilalmost 4 years ago

Licentiaalmost 4 years ago

thirealmost 4 years ago

gwbas1calmost 4 years ago

评论 #28048947 未加载

giantg2almost 4 years ago

评论 #28039854 未加载

dm3almost 4 years ago

tomaskafkaalmost 4 years ago

lolskialmost 4 years ago

microDudealmost 4 years ago

I work in semiconductor manufacturing, it's a very common model. I have been in about eight fabs around the world that use it quite successfully.

monksyalmost 4 years ago

评论 #28045572 未加载

TuringNYCalmost 4 years ago

marto1almost 4 years ago

It's kinda like using Lisp. It may be great, but it's harder to find friends :-)

soperjalmost 4 years ago

Event driven architecture honestly seems like a different flavour of all the things we hate with spaghetti code gotos.

wly_cdgralmost 4 years ago

Well, every game company :)Makes sense robotics would do it too. AAA games and robotics are basically the same field after all

jon-woodalmost 4 years ago

bullenalmost 4 years ago

评论 #28061569 未加载

swader999almost 4 years ago

Operational support is more interesting with this kind of an architecture. Dealing with message queues and all that can be challenging for a traditional organization.

8notealmost 4 years ago

polskibusalmost 4 years ago

评论 #28046743 未加载

phaedrusalmost 4 years ago

EamonnMRalmost 4 years ago

评论 #28046959 未加载

FpUseralmost 4 years ago

评论 #28048152 未加载

restersalmost 4 years ago

tyingqalmost 4 years ago

thdxralmost 4 years ago

评论 #28048896 未加载

评论 #28046787 未加载

xwolfialmost 4 years ago

评论 #28046844 未加载

the__alchemistalmost 4 years ago

blablabla123almost 4 years ago

nitwit005almost 4 years ago

MichaelMoser123almost 4 years ago

评论 #28052149 未加载

_3u10almost 4 years ago

valzamalmost 4 years ago

zarkov99almost 4 years ago

In finance, specifically in trading systems, event-driven is a natural fit and in my experience the default to which systems converge to.

sae3023almost 4 years ago

评论 #28048135 未加载

purpleideaalmost 4 years ago

Yes, have a look at <a href="https://github.com/purpleidea/mgmt/" rel="nofollow">https://github.com/purpleidea/mgmt/</a> Not at a 1.0 release yet, but there's enough for you to have fun with. LMK

arodygincalmost 4 years ago

pbreitalmost 4 years ago

Are there any rules of thumb where such an architecture should be considered? >X TPS? >Y milliseconds per txn? >Z milliseconds between write and subsequent read? Eventual consistency OK?

评论 #28045965 未加载

cturneralmost 4 years ago

zarathustrealalmost 4 years ago

If you think about it, modern front end web development (with React + Redux) is event-driven! In a way, there are tons of people embracing it

Animatsalmost 4 years ago

Window systems are classically event-driven. Especially earlier single-thread ones from Microsoft.

Ericson2314almost 4 years ago

them kids and their microservices...

kwdcalmost 4 years ago

Got a list of books for reference?

评论 #28045878 未加载

评论 #28046883 未加载

austincheneyalmost 4 years ago

nitrixalmost 4 years ago

You mean Erlang?

评论 #28045739 未加载

Graffuralmost 4 years ago

Would be very interested to know!

sidcoolalmost 4 years ago

Yes. I am interested in knowing what was done before event driven architecture to make robust asynchronous systems?

bob_robotoalmost 4 years ago

jerfalmost 4 years ago

swyxalmost 4 years ago

评论 #28046820 未加载

ProfHewittalmost 4 years ago

igituralmost 4 years ago