Hey HN, half a year ago we were building a feature where our users could trigger long-running jobs with the following requirements:
(i) the job is provisioned by the user action (e.g., clicking a button in the UI),
(ii) the job can be canceled by the user,
(iii) the progress of the job can be inspected while the job is running,
(iv) hardware for the job is provisioned only when needed (the job may require expensive hardware and isn't called that frequently).<p>Because we couldn't find any existing ways of achieving this we decided to implement one for Python. Currently, it uses an AWS ECS cluster (friendly Terraform to provision it attached!) but it should be very easy to port it to K8s so it works on other clouds as well.