Show HN: Hatchet – Open-source distributed task queue

578 pointsby abelangerabout 1 year ago

Hello HN, we're Gabe and Alexander from Hatchet (<a href="https://hatchet.run">https://hatchet.run</a>), we're working on an open-source, distributed task queue. It's an alternative to tools like Celery for Python and BullMQ for Node.js, primarily focused on reliability and observability. It uses Postgres for the underlying queue.Why build another managed queue? We wanted to build something with the benefits of full transactional enqueueing - particularly for dependent, DAG-style execution - and felt strongly that Postgres solves for 99.9% of queueing use-cases better than most alternatives (Celery uses Redis or RabbitMQ as a broker, BullMQ uses Redis). Since the introduction of SKIP LOCKED and the milestones of recent PG releases (like active-active replication), it's becoming more feasible to horizontally scale Postgres across multiple regions and vertically scale to 10k TPS or more. Many queues (like BullMQ) are built on Redis and data loss can occur when suffering OOM if you're not careful, and using PG helps avoid an entire class of problems.We also wanted something that was significantly easier to use and debug for application developers. A lot of times the burden of building task observability falls on the infra/platform team (for example, asking the infra team to build a Grafana view for their tasks based on exported prom metrics). We're building this type of observability directly into Hatchet.What do we mean by "distributed"? You can run workers (the instances which run tasks) across multiple VMs, clusters and regions - they are remotely invoked via a long-lived gRPC connection with the Hatchet queue. We've attempted to optimize our latency to get our task start times down to 25-50ms and much more optimization is on the roadmap.We also support a number of extra features that you'd expect, like retries, timeouts, cron schedules, dependent tasks. A few things we're currently working on - we use RabbitMQ (confusing, yes) for pub/sub between engine components and would prefer to just use Postgres, but didn't want to spend additional time on the exchange logic until we built a stable underlying queue. We are also considering the use of NATS for engine-engine and engine-worker connections.We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.

50 comments

kcorbittabout 1 year ago

I love your vision and am excited to see the execution! I've been looking for exactly this product (postgres-backed task queue with workers in multiple languages and decent built-in observability) for like... 3 years. Every 6 months I'll check in and see if someone has built it yet, evaluate the alternatives, and come away disappointed.One important feature request that probably would block our adoption: one reason why I prefer a postgres-backed queue over eg. Redis is just to simplify our infra by having fewer servers and technologies in the stack. Adding in RabbitMQ is definitely an extra dependency I'd really like to avoid.(Currently we've settled on graphile-worker which is fine for what it does, but leaves a lot of boxes unchecked.)

评论 #39647059 未加载

评论 #39644137 未加载

评论 #39647179 未加载

评论 #39652765 未加载

评论 #39650750 未加载

评论 #39645512 未加载

评论 #39652574 未加载

评论 #39651174 未加载

评论 #39646111 未加载

jerrygenserabout 1 year ago

Something I really like about some pub/sub systems is Push subscriptions. For example in GCP pub/sub you can have a "subscriber" that is not pulling events off the queue but instead is an http endpoint where events are pushed to.The nice thing about this is that you can use a runtime like cloud run or lambda and allow that runtime to scale based on http requests and also scale to zero.Setting up autoscaling for workers can be a little bit more finicky, e.g. in kubernetes you might set up KEDA autoscaling based on some queue depth metrics but these might need to be exported from rabbit.I suppose you could have a setup where your daemon worker is making http requests and in that sense "push" to the place where jobs are actually running but this adds another level of complexity.Is there any plan to support a push model where you can push jobs into http and some daemons that are holding the http connections opened?

评论 #39645123 未加载

评论 #39650089 未加载

评论 #39681765 未加载

评论 #39647854 未加载

评论 #39655909 未加载

评论 #39651913 未加载

评论 #39650936 未加载

评论 #39647156 未加载

leetroutabout 1 year ago

Just pointing out even though this is a "Show HN" they are, indeed, backed by YC.Is this going to follow the "open core" pattern or will there be a different path to revenue?

评论 #39646788 未加载

评论 #39645268 未加载

评论 #39650345 未加载

bluehadoopabout 1 year ago

How does this compare against Temporal/Cadence/Conductor? Does hatchet also support durable execution?<a href="https://temporal.io/" rel="nofollow">https://temporal.io/</a> <a href="https://cadenceworkflow.io/" rel="nofollow">https://cadenceworkflow.io/</a> <a href="https://conductor-oss.org/" rel="nofollow">https://conductor-oss.org/</a>

评论 #39644550 未加载

Kinranyabout 1 year ago

With NATS in the stack, what's the advantage over using NATS directly?

评论 #39649524 未加载

toddmoreyabout 1 year ago

I need task queues where the client (web browser) can listen to the progress of the task through completion.I love the simplicity & approachability of Deno queues for example, but I’d need to roll my own way to subscribe to task status from the client.Wondering if perhaps the Postgres underpinnings here would make that possible.EDIT: seems so! <a href="https://docs.hatchet.run/home/features/streaming">https://docs.hatchet.run/home/features/streaming</a>

评论 #39644404 未加载

评论 #39644287 未加载

srousseyabout 1 year ago

Ah nice! I am writing a job queue this weekend for a DAG based task runner, so timing is great. I will have a look. I don't need anything too big, but I have written some stuff for using PostgreSQL (FOR UPDATE SKIP LOCKED for the win), sqlite, and in-memory, depending on what I want to use it for.I want the task graph to run without thinking about retries, timeouts, serialized resources, etc.Interested to look at your particular approach.

SCUSKUabout 1 year ago

Looks pretty great! My biggest issue with Celery has been that the observability is pretty bad. Even if you use Celery Flower, it still just doesn’t give me enough insight when I’m trying to debug some problem in production.I’m all for just using Postgres in service of the grug brain philosophy.Will definitely be looking into this, congrats on the launch!

评论 #39644933 未加载

评论 #39646503 未加载

fuddleabout 1 year ago

Looks great! Do you publish pricing for your cloud offering? For the self hosted option, are there plans to create a Kubernetes operator? With an MIT license do you fear Amazon could create a Amazon Hatchet Service sometime in the future?

评论 #39647101 未加载

kevinlu1248about 1 year ago

We're building a webhook services on FastAPI + Celery + Redis + Grafana + Loki and the experience with setting up every service incrementally was miserable, and even then it feels like logs are being dropped and we run into reliability issues. Felt like something like this should exist already but I couldn't find anything at the time. Really excited to see where this takes us!

评论 #39643727 未加载

beerkatabout 1 year ago

How does this compare to River Queue (<a href="https://riverqueue.com/" rel="nofollow">https://riverqueue.com/</a>)? Besides the additional Python and TS client libraries.

评论 #39649469 未加载

cybiceabout 1 year ago

Why Hatchet might be better than Windmill: Windmill uses the same approach in PostgreSQL, very fast and has an incredibly good UI.

moribvndvsabout 1 year ago

One repeat issue I’ve had with my past position is need to schedule an unlimited number of jobs, often months to year from now. Example use case: a patient schedules an appointment for a follow up in 6 months, so I schedule a series of appointment reminders in the days leading up to it. I might have millions of these jobs.I started out by just entering a record into a database queue and just polling every few seconds. Functional, but our IO costs for polling weren’t ideal, and we wanted to distribute this without using stuff like schedlock. I switched to Redis but it got complicated dealing with multiple dispatchers, OOM issues, and having to run a secondary job to move individual tasks in and out of the immediate queue, etc. I had started looking at switching to backing it with PG and SKIP LOCKED, etc. but I’ve changed positions.I can see a similar use case on my horizon wondered if Hatchet would be suitable for it.

评论 #39647154 未加载

评论 #39646374 未加载

评论 #39646511 未加载

welderabout 1 year ago

Related, I also wrote my own distributed task queue in Python [0] and TypeScript [1] with a Show HN [2]. Time it took was about a week. I like your features, but it was easy to write my own so I'm curious how you're building a money making business around an open source product. Maybe the fact everyone writes their own means there's no best solution now, so you're trying to be that and do paid closed source features for revenue?[0] <a href="https://github.com/wakatime/wakaq">https://github.com/wakatime/wakaq</a>[1] <a href="https://github.com/wakatime/wakaq-ts">https://github.com/wakatime/wakaq-ts</a>[2] <a href="https://news.ycombinator.com/item?id=32730038">https://news.ycombinator.com/item?id=32730038</a>

评论 #39653597 未加载

topicseedabout 1 year ago

What specific strategies does Hatchet employ to guarantee fault tolerance and enable durable execution? How does it handle partial failures in multi-step workflows?

评论 #39643881 未加载

treesciencebotabout 1 year ago

Latency is really important and that is honestly why we re-wrote most of this stuck ourselves but the project with the gurantee of 25ms< looks interesting. I wish there was an "instant" mode where enough workers are available it could just do direct placement.

评论 #39647307 未加载

pyrosshabout 1 year ago

How is this different from pg-boss[1]? Other than the distributed part it also seems to use skip locked.[1] <a href="https://github.com/timgit/pg-boss">https://github.com/timgit/pg-boss</a>

评论 #39644327 未加载

krawczstefabout 1 year ago

Can you explain why you chose every function to take in context? <a href="https://github.com/hatchet-dev/hatchet/blob/main/python-sdk/examples/dag/worker.py">https://github.com/hatchet-dev/hatchet/blob/main/python-sdk/...</a>This seems like a lot of boiler plate to write functions with to me (context I created <a href="http://github.com/DAGWorks-Inc/hamilton">http://github.com/DAGWorks-Inc/hamilton</a>).

评论 #39649444 未加载

stephenabout 1 year ago

Wow, looks great! We currently happily use graphile-worker, and have two questions:> full transactional enqueueingDo you mean transactional within the same transaction as the application's own state?My guess is no (from looking at the docs, where enqueuing in the SDK looks a lot like a wire call and not issuing a SQL command over our application's existing connection pool), and that you mean transactionality between steps within the Hatchet jobs...I get that, but fwiw transactionality of "perform business logic against entities + job enqueue" (both for queuing the job itself, as well as work performed by workers) is the primary reason we're using a PG-based job queue, as then we avoid transactional outboxes for each queue/work step.So, dunno, loosing that would be a big deal/kinda defeat the purpose (for us) of a PG-based queue.2nd question, not to be a downer, but I'm just genuinely curious as a wanna-be dev infra/tooling engineer, but a) why take funding to build this (it seems bootstrappable? maybe that's naive), and b) why would YC keeping putting money into these "look really neat but ...surely?... will never be the 100x returns/billion dollar companies" dev infra startups? Or maybe I'm over-estimating the size of the return/exit necessary to make it worth their while.

评论 #39654002 未加载

acaloiarabout 1 year ago

A related lively dicussion from a few months ago: <a href="https://news.ycombinator.com/item?id=37636841">https://news.ycombinator.com/item?id=37636841</a>Long live Postgres queues.

mfrye0about 1 year ago

I've been looking for this exact thing for awhile now. I'm just starting to dig into the docs and examples, and I have a question on workflows.I have an existing pipeline that runs tasks across two K8 clusters and share a DB. Is it possible to define steps in a workflow where the step run logic is setup to run elsewhere? Essentially not having an inline run function defined, and another worker process listening for that step name.

评论 #39649456 未加载

cebertabout 1 year ago

The website for Hatchet and the GitHub repository make it look like a compelling distributed task queue solution. I see from the main website that this appears to have commercial aspirations, but I don’t see any pricing information available. Do you have a pricing model yet? I’d be apprehensive to consider using Hatchet in future projects without knowing how much it costs.

评论 #39653084 未加载

hinkleyabout 1 year ago

It’s been about a dozen years since I heard someone assert that some CI/CD services were the most reliable task scheduling software for periodic tasks (far better than cron). Shouldn’t the scheduling be factored out as a separate library?I found that shocking at the time, if plausible, and wondered why nobody pulled on that thread. I suppose like me they had bigger fish to fry.

评论 #39647338 未加载

评论 #39649944 未加载

dalbertoabout 1 year ago

I'm curious if this supports coroutines at tasks in Python. It's especially useful for genAI, and legacy queues (namely Celery) are lacking in this regard.It would help to see a mapping of Celery to Hatchet as examples. The current examples require you to understand (and buy into) Hatchet's model, but that's hard to do without understanding how it compares to existing solutions.

rubenfiszelabout 1 year ago

Ola, fellow YC founders. Surely you have seen Windmill since you refer to it in the comments below. It looks like Hatchet, being a lot more recent, has currently a subset of what Windmill offers, albeit with a focus solely on the task queue and without the self-hosted enterprise focus. So it looks more like a competitor to Inngest than of Windmill. We released workflows as code last week which was the primary differentiator with other workflow engines and us so far: <a href="https://www.windmill.dev/docs/core_concepts/workflows_as_code">https://www.windmill.dev/docs/core_concepts/workflows_as_cod...</a>The license is more permissive than ours MIT vs AGPLv3, and you're using Go vs Rust for us, but other than that the architecture looks extremely similar, also based mostly on Postgres with the same insights than us: it's sufficient. I'm curious where do you see the main differentiator long-term.

评论 #39651748 未加载

tzahifadidaabout 1 year ago

Why not use postgres listen/notify instead of rabbitmq pub sub.

评论 #39643940 未加载

评论 #39643663 未加载

rheckartabout 1 year ago

Any plans for SDKs outside the current three? .NET Core & Java would be interesting to see..

评论 #39647241 未加载

notpushkinabout 1 year ago

Congrats on the launch!You say Celery can use Redis or RabbitMQ as a backend, but I've also used it with Postgres as a broker successfully, although on a smaller scale (just a single DB node). It's undocumented, so definitely won't recommend anybody using this in production now, but seems to still work fine. [1]How does Hatchet compare to this setup? Also, have you considered making a plugin backend for Celery, so that old systems can be ported more easily?[1]: <a href="https://stackoverflow.com/a/47604045/1593459" rel="nofollow">https://stackoverflow.com/a/47604045/1593459</a>

nextworddevabout 1 year ago

I’m interested in self hosting this. What’s the recommendation here for state persistence and self healing? Wish there was a guide for a small team who wants to self host before trying managed cloud

评论 #39644042 未加载

radusabout 1 year ago

You've explained your value proposition vs. celery, but I'm curious if you also see Hatchet as an alternative to Nextflow/Snakemake which are commonly used in bioinformatics.

dataangelabout 1 year ago

> Distributed> Built on PostGRESNot what people usually mean by distributed, caveat emptor

评论 #39653926 未加载

iangregsonabout 1 year ago

I love this idea. I wish it existed a few years ago when I did a not so good job of implementing a distributed DAG processing system :DLooking forward to trying it out!

Kluggyabout 1 year ago

In <a href="https://docs.hatchet.run/home/quickstart/installation">https://docs.hatchet.run/home/quickstart/installation</a>, it says> Welcome to Hatchet! This guide walks you through getting set up on Hatchet Cloud. If you'd like to self-host Hatchet, please see the self-hosted quickstart instead.but the link to "self-hosted quickstart" links back to the same page

评论 #39644068 未加载

wereHamsterabout 1 year ago

Does it (or will it, ie. is it planned) support delayed execution? eg. I have a task that I want to run at a certain time in the future?

评论 #39654040 未加载

Yanaelabout 1 year ago

Looks very promising. Recently, I built an asynchronous DAG executor in Python, and I always felt I was reinventing the wheel, but when looking for a resilient and distributed DAG executor, nothing was really meeting the requirements. The feature set is appealing. Wondering if adding/removing/skipping nodes to the DAG dynamically at runtime is possible.

评论 #39649804 未加载

iacguyabout 1 year ago

Been following since Hatchet was an OSS TFC alternative. Seems like you guys pivoted. Curious to learn why and how you moved from the earlier value prop to this one?

wodenokotoabout 1 year ago

Since these are task executions in a DAG, to what degree does it compete with dagster or airflow? I get that I can’t define the task with Hatchet, but if I already want to separate my DAG from my tasks, is this a viable option?

评论 #39653663 未加载

serbrechabout 1 year ago

I wish that this was just a sdk built on top of a provider/standard. Amqp 1.0 is a standard protocol. You can build all this without being tied to a product or to rabbitMQ, with a storage provider and a amqp protocol layer.

zwapsabout 1 year ago

You say this is for generative AI. How do you distribute inference across workers? Can one use just any protocol and how does this work together with the queue and fault tolerance?Could not find any specifics on generative AI in your docs. Thanks

评论 #39646328 未加载

CoolColdabout 1 year ago

From your experience, what would be a good way for doing Postgres Master-Master ? My understanding that Postgres Professional/EnterpriseDB based solutions provide reliable M-M and those are proprietary.

fcspabout 1 year ago

> Hatchet is built on a low-latency queue (25ms average start)That seems pretty long - am I misunderstanding something? By my understanding this means the time from enqueue to job processing, maybe someone can enlighten me.

评论 #39647419 未加载

评论 #39647319 未加载

jbergstroemabout 1 year ago

Have you considered <a href="https://github.com/tembo-io/pgmq">https://github.com/tembo-io/pgmq</a> for the queue bit?

peterisdirsaabout 1 year ago

This is not a viable product, it's a feature

Fiahilabout 1 year ago

How does this compare to ZeroMQ (ZMQ) ?<a href="https://zeromq.org/" rel="nofollow">https://zeromq.org/</a>

评论 #39645348 未加载

评论 #39645572 未加载

pdksamabout 1 year ago

How is this different from cadence by Uber or swf?

hannasmabout 1 year ago

Seems like this summary should be in the README

Nukesorabout 1 year ago

Hey @abelanger,I got a few feature request for Pueue that were out of the scope as they didn't fit Pueue's vision, but seem to fit hatchet quite well (e.g. complex scheduling functionality and multi-agent support) :)One thing I'm missing from your website however, is an actual view from how the interface looks like, what does the actual user interface look like.Having the possibility to schedule stuff in a smart way is nice and all, but how do you *overlook* it? It's important to get a good overview of how your tasks perform.Once I'm convinced that this is actually a useful piece of software, I would like to reference you in the Readme of Pueue as a alternative for users that need more powerful scheduling features (or multi-client support) :) Would that be ok for you?

评论 #39649385 未加载

sixhobbitsabout 1 year ago

One of my favourite spaces and presentation in readme is clear and immediately told me what it is and most of the key information that I usually complain is missing.However I am still missing a section on why this is different than any of the other existing and more mature solutions. What led you to develop this over existing options and what different tradeoffs did you make? Extra points if you can concisely tell me what you do badly that your 'competitors' do well because I don't believe there is a one best solution in this space, it is all tradeoffs

评论 #39643489 未加载

ctothabout 1 year ago

My only question is why did you call it Hatchet if it doesn't cut down on your logs?I'll show myself out.

评论 #39649482 未加载

adeptimaabout 1 year ago

Exciting time for distributed, transactional task queue projects built on the top of PostgreSQL!Here's the most heavily upvoted in the past 12 monthsHatchet <a href="https://news.ycombinator.com/item?id=39643136">https://news.ycombinator.com/item?id=39643136</a>Inngest <a href="https://news.ycombinator.com/item?id=36403014">https://news.ycombinator.com/item?id=36403014</a>Windmill <a href="https://news.ycombinator.com/item?id=35920082">https://news.ycombinator.com/item?id=35920082</a>HN comments on Temporal.io <a href="https://github.com/temporalio">https://github.com/temporalio</a> <a href="https://hn.algolia.com/?dateRange=all&page=0&prefix=false&query=Temporal.io&sort=byDate&type=comment" rel="nofollow">https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...</a>Internally we rant about the complexity of the above projects vs using transactional job queues libs like:river <a href="https://news.ycombinator.com/item?id=38349716">https://news.ycombinator.com/item?id=38349716</a>neoq: [<a href="https://github.com/acaloiaro/neoq](https://github.com/acaloiaro/neoq)">https://github.com/acaloiaro/neoq](https://github.com/acaloi...</a>gue: [<a href="https://github.com/vgarvardt/gue](https://github.com/vgarvardt/gue)">https://github.com/vgarvardt/gue](https://github.com/vgarvar...</a>Deep inside can't wait to see some like ThePrimeTimeagen to review it ;) <a href="https://www.youtube.com/@ThePrimeTimeagen" rel="nofollow">https://www.youtube.com/@ThePrimeTimeagen</a>

评论 #39651197 未加载