Great breakdown of the "architectural decision log" for the evolution of this system.<p>> This model broke down when we added backpressure and resilience patterns to our agents. We faced new challenges: what happens when the third of five LLM calls fails during an agent’s decision process? Should we retry everything? Save partial results and retry just the failed call? When do we give up and error out?”<p>> We first looked at ETL tools like Apache Airflow. While great for data engineering, Airflow’s focus on stateless, scheduled tasks wasn’t a good fit for our agents’ stateful, event-driven operations.<p>> I’d heard great things about Temporal from my previous teams at DigitalOcean. It’s built for long-running, stateful workflows, offering the durability and resilience we needed out of the box.<p>I would also have reached for workflow engines here. But I wonder if Actor frameworks might actually be the sweet spot; something like Erlang's distributed actor model could be a good fit. I'm not familiar with a good distributed Actor framework for Python but there's of course Elixir, Actix, Akka in other stacks.<p>Coming from the other direction, I'm not surprised that Airflow isn't fit for this purpose, but I wonder if one of the newer generation of ETL engines like Dagster would work? Maybe the workflow here just involves too many pipelines (one per customer per Agent, I suppose), and too many Sensor events (each Slack message would get materialized, not sure if that's excessive). Could be a fairly substantial overhaul to the architecture vs. Temporal, but I'd be interested to know if anyone has experimented with this option for AI workflows.