Show HN: Managed GitHub Actions Runners for AWS

117 pointsby jacobwgabout 1 year ago

Hey HN! I'm Jacob, one of the founders of Depot (<a href="https://depot.dev">https://depot.dev</a>), a build service for Docker images, and I'm excited to show what we’ve been working on for the past few months: run GitHub Actions jobs in AWS, orchestrated by Depot!Here's a video demo: <a href="https://www.youtube.com/watch?v=VX5Z-k1mGc8" rel="nofollow">https://www.youtube.com/watch?v=VX5Z-k1mGc8</a>, and here’s our blog post: <a href="https://depot.dev/blog/depot-github-actions-runners">https://depot.dev/blog/depot-github-actions-runners</a>.While GitHub Actions is one of the most prevalent CI providers, Actions is slow, for a few reasons: GitHub uses underpowered CPUs, network throughput for cache and the internet at large is capped at 1 Gbps, and total cache storage is limited to 10GB per repo. It is also rather expensive for runners with more than 2 CPUs, and larger runners frequently take a long time to start running jobs.Depot-managed runners solve this! Rather than your CI jobs running on GitHub's slow compute, Depot routes those same jobs to fast EC2 instances. And not only is this faster, it’s also 1/2 the cost of GitHub Actions!We do this by launching a dedicated instance for each job, registering that instance as a self-hosted Actions runner in your GitHub organization, then terminating the instance when the job is finished. Using AWS as the compute provider has a few advantages:- CPUs are typically 30%+ more performant than alternatives (the m7a instance type).- Each instance has high-throughput networking of up to 12.5 Gbps, hosted in us-east-1, so interacting with artifacts, cache, container registries, or the internet at large is quick.- Each instance has a public IPv4 address, so it does not share rate limits with anyone else.We integrated the runners with the distributed cache system (backed by S3 and Ceph) that we use for Docker build cache, so jobs automatically save / restore cache from this cache system, with speeds of up to 1 GB/s, and without the default 10 GB per repo limit.Building this was a fun challenge; some matrix workflows start 40+ jobs at once, then requiring 40 EC2 instances to launch at once.We’ve effectively gotten very good at starting EC2 instances with a "warm pool" system which allows us to prepare many EC2 instances to run a job, stop them, then resize and start them when an actual job request arrives, to keep job queue times around 5 seconds. We're using a homegrown orchestration system, as alternatives like autoscaling groups or Kubernetes weren't fast or secure enough.There are three alternatives to our managed runners currently:1. GitHub offers larger runners: these have more CPUs, but still have slow network and cache. Depot runners are also 1/2 the cost per minute of GitHub's runners.2. You can self-host the Actions runner on your own compute: this requires ongoing maintenance, and it can be difficult to ensure that the runner image or container matches GitHub's.3. There are other companies offering hosted GitHub Actions runners, though they frequently use cheaper compute hosting providers that are bottlenecked on network throughput or geography.Any feedback is very welcome! You can sign up at <a href="https://depot.dev/sign-up">https://depot.dev/sign-up</a> for a free trial if you'd like to try it out on your own workflows. We aren't able to offer a trial without a signup gate, both because using it requires installing a GitHub app, and we're offering build compute, so we need some way to keep out the cryptominers :)

19 comments

jitlabout 1 year ago

At Notion we run our GitHub Actions jobs on ECS, and use auto-scaling to add and remove hosts from the ECS cluster as demand fluctuates throughout the day. We also age out and terminate hosts although they usually live for a few days to a week. I guess we had to pay some one time setup costs around configuring the ECS cluster and fiddling runner tags, but it seems to work pretty well. We have our own cache action although it’s not as fancy as depot’s, just a tarball in s3.Overall it’s pretty simple terraform setup plus a couple dockerfiles. And we get to run in the same region as the rest of our infra that’s close to most of our devs (us-west-2).ECS might sound more complicated than “just use ec2” but we don’t have to screw around with lambdas and the terraform is pretty simple, much simpler then the Philips-labs one. It’s about 1400 lines of Terraform across 2 files since ECS has so much stuff built in and integrates with auto scale groups well.

评论 #39939935 未加载

评论 #39939147 未加载

评论 #39940605 未加载

toomuchtodoabout 1 year ago

How will you compete if GitHub talks to the Azure folks (who have the benefit of Azure scale) and gets better compute and network treatment for runners? Or is the assumption GH running remains perpetually stunted as described (which is potentially a fair and legit assumption to make based on MS silos and enterprise inertia)?To be clear, this is a genuine question, as compute (even when efficiently orchestrated and arbitraged) is a commodity. Your cache strategy is good (will be interested in testing to tease out where is S3 and where is Ceph), but not a moat and somewhat straightforward to replicate.(again, questions from a place of curiosity, nothing more)

评论 #39935484 未加载

评论 #39935680 未加载

评论 #39935489 未加载

评论 #39938272 未加载

werewrsdfabout 1 year ago

I recently set up AWS Github runners with this terraform. It works well and you don't have to pay any extra in addition to AWS.<a href="https://github.com/philips-labs/terraform-aws-github-runner">https://github.com/philips-labs/terraform-aws-github-runner</a>

评论 #39936283 未加载

评论 #39935774 未加载

SOLAR_FIELDSabout 1 year ago

One of the most interesting value adds for me is not any of the things mentioned by OP. I would like to have a managed hosted runner solution where I can have a buildkit cache also in the same data center that I manage where I don’t have to pay ingress/egress to that cache but also I don’t have to manage my own runner infra. I have done the whole self hosted Karpenter + Actions Runner Controller thing to achieve this and it is a lot of work to set up and tune to get right.The problem is actually really that GitHub’s caching offering is very limited for anything except the most basic of use cases and also they don’t offer a way to colo your own cache with them so that you aren’t paying cloud fees back and forth. You have to use their machines, their storage and their protocol which is only really viable if your definition of caching is literally just “upload files here” and “check if the uploaded built file already exists”.Yes, I’m aware that buildkit offers “experimental” GHA caching support. But given how fat image layers are it’s basically unusable for anything beyond a toy project that builds a couple layers on top of an alpine image (as of the time of writing this post GHA limits cache size to 10gb per repo. Fine if you’re building npm or pypi packages or whatever, but hilariously inadequate for buildkit layer caching)

评论 #39939348 未加载

评论 #39939552 未加载

评论 #39939483 未加载

评论 #39939223 未加载

cocoflunchyabout 1 year ago

Half the price of github is not great right now, this space is heating up! Ubicloud is 10x cheaper and <a href="https://runs-on.com" rel="nofollow">https://runs-on.com</a> is in the same ballpark by using spot instances. (Currently switching to RunsOn)

评论 #39940040 未加载

madispabout 1 year ago

> - Each instance has high-throughput networking of up to 12.5 Gbps, hosted in us-east-1, so interacting with artifacts, cache, container registries, or the internet at large is quick.do you actually get the promised 12.5 Gbps? I've been doing some experiments and it's really hard to get over 2.5Gbit/s upstream from AWS EC2, even when using large 64 vCPU machines. Intra-AWS (e.g. VPC) traffic is another thing and that seems to be ok.

评论 #39936282 未加载

评论 #39937992 未加载

watermelon0about 1 year ago

Hey, @jacobwg, this looks great.I couldn't find it anywhere on the page, but do you support Graviton3 (i.e. m7g instances) for GHA Runners? If the answer is no, are there any plans to support it in the future?> start them when an actual job request arrives, to keep job queue times around 5 secondsDid you have to fine-tune Ubuntu kernel/systemd boot to reach such fast startup times?

评论 #39935628 未加载

评论 #39935513 未加载

jillesvangurpabout 1 year ago

A few months ago I had some issues with build performance. We're on the free plan with Github so using custom runners is not an issue. But I found a nice workaround:- create a virtual machine with everything you need in gcloud (would work for aws as well). Pick something nice and fast. Suspend it.- in your github action, resume the vm, ssh into it to run your build script, and suspend it afterwards.Super easy to implement and easy to script using gcloud commands. It adds about 30 seconds of time to the build for starting the vm. On the machine, we simply pull from git and checkout the relevant branch. Doesn't work for concurrent builds but it's a nice low tech solution. And you only pay for the time the machine is up and running, which is a few minutes per day. So, you can get away with using vms that have lots of CPU and memory.

bpshabout 1 year ago

Interesting, makes a lot of sense to me as far as pricing too. However, I feel the video demonstration could greatly improve in terms of explaining and enthusiasm. It's super cool though and presentations/demos should showcase the full potential!

math0neabout 1 year ago

I used this to setup my runners on a dedicated server: <a href="https://github.com/vbem/multi-runners">https://github.com/vbem/multi-runners</a>

nodesocketabout 1 year ago

A cool idea, but not sure the business case. I wrote a quick and dirty bash script which automates the process of adding 2x GitHub runners on instances (2 CPU cores and 4 GB memory each). Simply scale out horizontally. Since the instances are persistent you get docker image caching out of the box unlike hosted runners on GitHub. Also arm64 is fully supported.

brycelarkinabout 1 year ago

For the AWS CDK folks, I’ve been very happy with this library. <a href="https://github.com/CloudSnorkel/cdk-github-runners">https://github.com/CloudSnorkel/cdk-github-runners</a>. Love that I can use spot pricing and the c7g instances for cicd.

timvdalenabout 1 year ago

Congrats on shipping! We built something similar internally. Tweaking it for the right cost/availability/speed was interesting, but we now have it working to where workers are generally spun up from 0 faster than GitHub's own are.

评论 #39935814 未加载

LilBytesabout 1 year ago

Hey Jacob, awesome suggestion!Are you building your base image from the GitHub runner-images repo?Do you have any appetite for building self hosted EC2 agents for Azure DevOps and GitHub?I'm happy to help if you are, I'm working on something similar myself for my employer.

评论 #39939611 未加载

siborgabout 1 year ago

Your website is surprisingly good. Often, show hn sites are pretty basic and a little off the mark, but this was clear. Pricing seems simple too. Great job. Will give it a try.

alas44about 1 year ago

How do you ensure privacy/isolation between users if you have a pool of ready VMs that you re-use?

评论 #39935661 未加载

playingalongabout 1 year ago

Can I use my own AWS account?

评论 #39935301 未加载

YouWhyabout 1 year ago

TL;DR: managed runners by construction constitute a major ongoing infosec liability.A managed runner means not only entrusting a third party with your code but also typically providing it with enough data/network connectivity to make testing/validation feasible as a part of the build process. While this is doable per se, it introduces multiple major failure modes outside of data owners' control.Failure scenario (hypothetical): you hydrate your test DB using live data; you store it in a dedicated secure S3 bucket, which you make accessible for the build process. Now the managed runner organization gets hacked because making resilient infra is hard, and the attackers intercept the S3 credentials used by your build process. Boom! Your live data is now at the mercy of the attackers.

评论 #39938806 未加载

pestkrankerabout 1 year ago

How does it compare to BuildJet?

评论 #39935535 未加载