Hello HN, we're Gabe and Alexander from Hatchet (<a href="https://hatchet.run">https://hatchet.run</a>), we're working on an open-source, distributed task queue. It's an alternative to tools like Celery for Python and BullMQ for Node.js, primarily focused on reliability and observability. It uses Postgres for the underlying queue.<p>Why build another managed queue? We wanted to build something with the benefits of full transactional enqueueing - particularly for dependent, DAG-style execution - and felt strongly that Postgres solves for 99.9% of queueing use-cases better than most alternatives (Celery uses Redis or RabbitMQ as a broker, BullMQ uses Redis). Since the introduction of SKIP LOCKED and the milestones of recent PG releases (like active-active replication), it's becoming more feasible to horizontally scale Postgres across multiple regions and vertically scale to 10k TPS or more. Many queues (like BullMQ) are built on Redis and data loss can occur when suffering OOM if you're not careful, and using PG helps avoid an entire class of problems.<p>We also wanted something that was significantly easier to use and debug for application developers. A lot of times the burden of building task observability falls on the infra/platform team (for example, asking the infra team to build a Grafana view for their tasks based on exported prom metrics). We're building this type of observability directly into Hatchet.<p>What do we mean by "distributed"? You can run workers (the instances which run tasks) across multiple VMs, clusters and regions - they are remotely invoked via a long-lived gRPC connection with the Hatchet queue. We've attempted to optimize our latency to get our task start times down to 25-50ms and much more optimization is on the roadmap.<p>We also support a number of extra features that you'd expect, like retries, timeouts, cron schedules, dependent tasks. A few things we're currently working on - we use RabbitMQ (confusing, yes) for pub/sub between engine components and would prefer to just use Postgres, but didn't want to spend additional time on the exchange logic until we built a stable underlying queue. We are also considering the use of NATS for engine-engine and engine-worker connections.<p>We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.
I love your vision and am excited to see the execution! I've been looking for <i>exactly</i> this product (postgres-backed task queue with workers in multiple languages and decent built-in observability) for like... 3 years. Every 6 months I'll check in and see if someone has built it yet, evaluate the alternatives, and come away disappointed.<p>One important feature request that probably would block our adoption: one reason why I prefer a postgres-backed queue over eg. Redis is just to simplify our infra by having fewer servers and technologies in the stack. Adding in RabbitMQ is definitely an extra dependency I'd really like to avoid.<p>(Currently we've settled on graphile-worker which is fine for what it does, but leaves a lot of boxes unchecked.)
Something I really like about some pub/sub systems is Push subscriptions. For example in GCP pub/sub you can have a "subscriber" that is not pulling events off the queue but instead is an http endpoint where events are pushed to.<p>The nice thing about this is that you can use a runtime like cloud run or lambda and allow that runtime to scale based on http requests and also scale to zero.<p>Setting up autoscaling for workers can be a little bit more finicky, e.g. in kubernetes you might set up KEDA autoscaling based on some queue depth metrics but these might need to be exported from rabbit.<p>I suppose you could have a setup where your daemon worker is making http requests and in that sense "push" to the place where jobs are actually running but this adds another level of complexity.<p>Is there any plan to support a push model where you can push jobs into http and some daemons that are holding the http connections opened?
Just pointing out even though this is a "Show HN" they are, indeed, backed by YC.<p>Is this going to follow the "open core" pattern or will there be a different path to revenue?
How does this compare against Temporal/Cadence/Conductor? Does hatchet also support durable execution?<p><a href="https://temporal.io/" rel="nofollow">https://temporal.io/</a>
<a href="https://cadenceworkflow.io/" rel="nofollow">https://cadenceworkflow.io/</a>
<a href="https://conductor-oss.org/" rel="nofollow">https://conductor-oss.org/</a>
I need task queues where the client (web browser) can listen to the progress of the task through completion.<p>I love the simplicity & approachability of Deno queues for example, but I’d need to roll my own way to subscribe to task status from the client.<p>Wondering if perhaps the Postgres underpinnings here would make that possible.<p>EDIT: seems so!
<a href="https://docs.hatchet.run/home/features/streaming">https://docs.hatchet.run/home/features/streaming</a>
Ah nice! I am writing a job queue this weekend for a DAG based task runner, so timing is great. I will have a look. I don't need anything too big, but I have written some stuff for using PostgreSQL (FOR UPDATE SKIP LOCKED for the win), sqlite, and in-memory, depending on what I want to use it for.<p>I want the task graph to run without thinking about retries, timeouts, serialized resources, etc.<p>Interested to look at your particular approach.
Looks pretty great! My biggest issue with Celery has been that the observability is pretty bad. Even if you use Celery Flower, it still just doesn’t give me enough insight when I’m trying to debug some problem in production.<p>I’m all for just using Postgres in service of the grug brain philosophy.<p>Will definitely be looking into this, congrats on the launch!
Looks great! Do you publish pricing for your cloud offering?
For the self hosted option, are there plans to create a Kubernetes operator? With an MIT license do you fear Amazon could create a Amazon Hatchet Service sometime in the future?
We're building a webhook services on FastAPI + Celery + Redis + Grafana + Loki and the experience with setting up every service incrementally was miserable, and even then it feels like logs are being dropped and we run into reliability issues. Felt like something like this should exist already but I couldn't find anything at the time. Really excited to see where this takes us!
How does this compare to River Queue (<a href="https://riverqueue.com/" rel="nofollow">https://riverqueue.com/</a>)? Besides the additional Python and TS client libraries.
One repeat issue I’ve had with my past position is need to schedule an unlimited number of jobs, often months to year from now. Example use case: a patient schedules an appointment for a follow up in 6 months, so I schedule a series of appointment reminders in the days leading up to it. I might have millions of these jobs.<p>I started out by just entering a record into a database queue and just polling every few seconds. Functional, but our IO costs for polling weren’t ideal, and we wanted to distribute this without using stuff like schedlock. I switched to Redis but it got complicated dealing with multiple dispatchers, OOM issues, and having to run a secondary job to move individual tasks in and out of the immediate queue, etc. I had started looking at switching to backing it with PG and SKIP LOCKED, etc. but I’ve changed positions.<p>I can see a similar use case on my horizon wondered if Hatchet would be suitable for it.
Related, I also wrote my own distributed task queue in Python [0] and TypeScript [1] with a Show HN [2]. Time it took was about a week. I like your features, but it was easy to write my own so I'm curious how you're building a money making business around an open source product. Maybe the fact everyone writes their own means there's no best solution now, so you're trying to be that and do paid closed source features for revenue?<p>[0] <a href="https://github.com/wakatime/wakaq">https://github.com/wakatime/wakaq</a><p>[1] <a href="https://github.com/wakatime/wakaq-ts">https://github.com/wakatime/wakaq-ts</a><p>[2] <a href="https://news.ycombinator.com/item?id=32730038">https://news.ycombinator.com/item?id=32730038</a>
What specific strategies does Hatchet employ to guarantee fault tolerance and enable durable execution? How does it handle partial failures in multi-step workflows?
Latency is really important and that is honestly why we re-wrote most of this stuck ourselves but the project with the gurantee of 25ms< looks interesting. I wish there was an "instant" mode where enough workers are available it could just do direct placement.
How is this different from pg-boss[1]? Other than the distributed part it also seems to use skip locked.<p>[1] <a href="https://github.com/timgit/pg-boss">https://github.com/timgit/pg-boss</a>
Can you explain why you chose every function to take in context? <a href="https://github.com/hatchet-dev/hatchet/blob/main/python-sdk/examples/dag/worker.py">https://github.com/hatchet-dev/hatchet/blob/main/python-sdk/...</a><p>This seems like a lot of boiler plate to write functions with to me (context I created <a href="http://github.com/DAGWorks-Inc/hamilton">http://github.com/DAGWorks-Inc/hamilton</a>).
Wow, looks great! We currently happily use graphile-worker, and have two questions:<p>> full transactional enqueueing<p>Do you mean transactional within the same transaction as the application's own state?<p>My guess is no (from looking at the docs, where enqueuing in the SDK looks a lot like a wire call and not issuing a SQL command over our application's existing connection pool), and that you mean transactionality between steps within the Hatchet jobs...<p>I get that, but fwiw transactionality of "perform business logic against entities + job enqueue" (both for queuing the job itself, as well as work performed by workers) is the primary reason we're using a PG-based job queue, as then we avoid transactional outboxes for each queue/work step.<p>So, dunno, loosing that would be a big deal/kinda defeat the purpose (for us) of a PG-based queue.<p>2nd question, not to be a downer, but I'm just genuinely curious as a wanna-be dev infra/tooling engineer, but a) why take funding to build this (it seems bootstrappable? maybe that's naive), and b) why would YC keeping putting money into these "look really neat but ...surely?... will never be the 100x returns/billion dollar companies" dev infra startups? Or maybe I'm over-estimating the size of the return/exit necessary to make it worth their while.
A related lively dicussion from a few months ago: <a href="https://news.ycombinator.com/item?id=37636841">https://news.ycombinator.com/item?id=37636841</a><p>Long live Postgres queues.
I've been looking for this exact thing for awhile now. I'm just starting to dig into the docs and examples, and I have a question on workflows.<p>I have an existing pipeline that runs tasks across two K8 clusters and share a DB. Is it possible to define steps in a workflow where the step run logic is setup to run elsewhere? Essentially not having an inline run function defined, and another worker process listening for that step name.
The website for Hatchet and the GitHub repository make it look like a compelling distributed task queue solution. I see from the main website that this appears to have commercial aspirations, but I don’t see any pricing information available. Do you have a pricing model yet? I’d be apprehensive to consider using Hatchet in future projects without knowing how much it costs.
It’s been about a dozen years since I heard someone assert that some CI/CD services were the most reliable task scheduling software for periodic tasks (far better than cron). Shouldn’t the scheduling be factored out as a separate library?<p>I found that shocking at the time, if plausible, and wondered why nobody pulled on that thread. I suppose like me they had bigger fish to fry.
I'm curious if this supports coroutines at tasks in Python. It's especially useful for genAI, and legacy queues (namely Celery) are lacking in this regard.<p>It would help to see a mapping of Celery to Hatchet as examples. The current examples require you to understand (and buy into) Hatchet's model, but that's hard to do without understanding how it compares to existing solutions.
Ola, fellow YC founders. Surely you have seen Windmill since you refer to it in the comments below. It looks like Hatchet, being a lot more recent, has currently a subset of what Windmill offers, albeit with a focus solely on the task queue and without the self-hosted enterprise focus. So it looks more like a competitor to Inngest than of Windmill. We released workflows as code last week which was the primary differentiator with other workflow engines and us so far: <a href="https://www.windmill.dev/docs/core_concepts/workflows_as_code">https://www.windmill.dev/docs/core_concepts/workflows_as_cod...</a><p>The license is more permissive than ours MIT vs AGPLv3, and you're using Go vs Rust for us, but other than that the architecture looks extremely similar, also based mostly on Postgres with the same insights than us: it's sufficient. I'm curious where do you see the main differentiator long-term.
Congrats on the launch!<p>You say Celery can use Redis or RabbitMQ as a backend, but I've also used it with Postgres as a broker successfully, although on a smaller scale (just a single DB node). It's undocumented, so definitely won't recommend anybody using this in production now, but seems to still work fine. [1]<p>How does Hatchet compare to this setup? Also, have you considered making a plugin backend for Celery, so that old systems can be ported more easily?<p>[1]: <a href="https://stackoverflow.com/a/47604045/1593459" rel="nofollow">https://stackoverflow.com/a/47604045/1593459</a>
I’m interested in self hosting this. What’s the recommendation here for state persistence and self healing? Wish there was a guide for a small team who wants to self host before trying managed cloud
You've explained your value proposition vs. celery, but I'm curious if you also see Hatchet as an alternative to Nextflow/Snakemake which are commonly used in bioinformatics.
I love this idea. I wish it existed a few years ago when I did a not so good job of implementing a distributed DAG processing system :D<p>Looking forward to trying it out!
In <a href="https://docs.hatchet.run/home/quickstart/installation">https://docs.hatchet.run/home/quickstart/installation</a>, it says<p>> Welcome to Hatchet! This guide walks you through getting set up on Hatchet Cloud. If you'd like to self-host Hatchet, please see the self-hosted quickstart instead.<p>but the link to "self-hosted quickstart" links back to the same page
Looks very promising. Recently, I built an asynchronous DAG executor in Python, and I always felt I was reinventing the wheel, but when looking for a resilient and distributed DAG executor, nothing was really meeting the requirements. The feature set is appealing. Wondering if adding/removing/skipping nodes to the DAG dynamically at runtime is possible.
Been following since Hatchet was an OSS TFC alternative. Seems like you guys pivoted. Curious to learn why and how you moved from the earlier value prop to this one?
Since these are task executions in a DAG, to what degree does it compete with dagster or airflow? I get that I can’t define the task with Hatchet, but if I already want to separate my DAG from my tasks, is this a viable option?
I wish that this was just a sdk built on top of a provider/standard.
Amqp 1.0 is a standard protocol.
You can build all this without being tied to a product or to rabbitMQ, with a storage provider and a amqp protocol layer.
You say this is for generative AI. How do you distribute inference across workers? Can one use just any protocol and how does this work together with the queue and fault tolerance?<p>Could not find any specifics on generative AI in your docs. Thanks
From your experience, what would be a good way for doing Postgres Master-Master ? My understanding that Postgres Professional/EnterpriseDB based solutions provide reliable M-M and those are proprietary.
> Hatchet is built on a low-latency queue (25ms average start)<p>That seems pretty long - am I misunderstanding something? By my understanding this means the time from enqueue to job processing, maybe someone can enlighten me.
Have you considered <a href="https://github.com/tembo-io/pgmq">https://github.com/tembo-io/pgmq</a> for the queue bit?
Hey @abelanger,<p>I got a few feature request for Pueue that were out of the scope as they didn't fit Pueue's vision, but seem to fit hatchet quite well (e.g. complex scheduling functionality and multi-agent support) :)<p>One thing I'm missing from your website however, is an actual view from how the interface looks like, what does the actual user interface look like.<p>Having the possibility to schedule stuff in a smart way is nice and all, but how do you *overlook* it? It's important to get a good overview of how your tasks perform.<p>Once I'm convinced that this is actually a useful piece of software, I would like to reference you in the Readme of Pueue as a alternative for users that need more powerful scheduling features (or multi-client support) :) Would that be ok for you?
One of my favourite spaces and presentation in readme is clear and immediately told me what it is and most of the key information that I usually complain is missing.<p>However I am still missing a section on why this is different than any of the other existing and more mature solutions. What led you to develop this over existing options and what different tradeoffs did you make? Extra points if you can concisely tell me what you do badly that your 'competitors' do well because I don't believe there is a one best solution in this space, it is all tradeoffs
Exciting time for distributed, transactional task queue projects built on the top of PostgreSQL!<p>Here's the most heavily upvoted in the past 12 months<p>Hatchet <a href="https://news.ycombinator.com/item?id=39643136">https://news.ycombinator.com/item?id=39643136</a><p>Inngest <a href="https://news.ycombinator.com/item?id=36403014">https://news.ycombinator.com/item?id=36403014</a><p>Windmill <a href="https://news.ycombinator.com/item?id=35920082">https://news.ycombinator.com/item?id=35920082</a><p>HN comments on Temporal.io
<a href="https://github.com/temporalio">https://github.com/temporalio</a>
<a href="https://hn.algolia.com/?dateRange=all&page=0&prefix=false&query=Temporal.io&sort=byDate&type=comment" rel="nofollow">https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...</a><p>Internally we rant about the complexity of the above projects vs using transactional job queues libs like:<p>river <a href="https://news.ycombinator.com/item?id=38349716">https://news.ycombinator.com/item?id=38349716</a><p>neoq: [<a href="https://github.com/acaloiaro/neoq](https://github.com/acaloiaro/neoq)">https://github.com/acaloiaro/neoq](https://github.com/acaloi...</a><p>gue: [<a href="https://github.com/vgarvardt/gue](https://github.com/vgarvardt/gue)">https://github.com/vgarvardt/gue](https://github.com/vgarvar...</a><p>Deep inside can't wait to see some like ThePrimeTimeagen to review it ;)
<a href="https://www.youtube.com/@ThePrimeTimeagen" rel="nofollow">https://www.youtube.com/@ThePrimeTimeagen</a>