Launch HN: Parity (YC S24) – AI for on-call engineers working with Kubernetes

83 点作者 wilson0909 个月前

Hey HN — we’re Jeffrey, Coleman, and Wilson, and we’re building Parity (<a href="https://tryparity.com">https://tryparity.com</a>), an AI SRE copilot for on-call engineers working with Kubernetes. Before you've opened your laptop, Parity has conducted an investigation to triage, determine root cause, and suggest a remediation for an issue. You can check out a quick demo of Parity here: <a href="https://tryparity.com/demo">https://tryparity.com/demo</a>We met working together as engineers at Crusoe, a cloud provider, and we always dreaded being on-call. It meant a week of putting our lives and projects on hold to be prepared to firefight an issue at any hour of the day. We experienced sleepless nights after being woken up by a PagerDuty alert to then find and follow a runbook. We canceled plans to make time to sift through dashboards and logs in search of the root cause of downtime in our k8s cluster.After speaking with other devs and SREs, we realized we weren’t alone. While every team wants better monitoring systems or a more resilient design, the reality is that time and resources are often too limited to make these investments.We’re building Parity to solve this problem. We’re enabling engineers working with Kubernetes to more easily handle their on-call by using AI agents to execute runbooks and conduct root cause analysis. We knew LLMs could help given their ability to quickly process and interpret large amounts of data. But we’ve found that LLMs alone aren’t sufficiently capable, so we’ve built agents to take on more complex tasks like root cause analysis. By allowing on-call engineers to handle these tasks more easily and eventually freeing them from such responsibilities, we create more time for them to focus on complex and valuable engineering investments.We built an agent to investigate issues in Kubernetes by following the same steps a human would: developing a possible root cause, validating it with logs and metrics, and iterating until a well-supported root cause is found. Given a symptom like “we’re seeing elevated 503 errors”, our agent develops hypotheses as to why this may be the case, such as nginx being misconfigured or application pods being under-resourced. Then, it gathers the necessary information from the cluster to either support or rule out those hypotheses. These results are presented to the engineer as a report with a summary and each hypothesis. It includes all the evidence the agent considered when coming to a conclusion so that an engineer can quickly review and validate the results. With the results of the investigation, an on-call engineer can focus on implementing a fix.We’ve built an additional agent to automatically execute runbooks when an alert is triggered. It follows steps of a runbook more rigorously than an LLM alone and with more flexibility than workflow automation tools like Temporal. This agent is a combination of separate LLM agents each responsible for a single step of the runbook. Each runbook step agent will execute arbitrary instructions like “look for nginx logs that could explain the 503 error”. A separate LLM will evaluate the results, ensuring the step agent followed the instructions, and determines which subsequent step of the runbook to execute. This allows us to execute runbooks with cycles, retries, and complex branching conditions.With these tools, we aim to handle the “what’s going wrong” part of on-call for engineers. We still believe it makes the most sense to continue to trust engineers with actually resolving issues as this requires potentially dangerous or irreversible commands. For that reason, our agents exclusively execute read-only commands.If this sounds like it could be useful for you, we’d love for you to give the product a try! Our service can be installed in your cluster via a helm repo in just a couple of minutes. For our HN launch, we’ve removed the billing requirement for new accounts, so you can test it out on your cluster for free.We’d love to hear your feedback in the comments!

14 条评论

Atotalnoob9 个月前

It would be kind of interesting if, based on an engineer accepting the suggestion, parity generated a new run book.This would allow repeated issues to be well documented.On iOS Firefox, when clicking “pricing” on the menu, it scrolls to the proper location, but does not close the menu. Closing the menu causes it to jump to the top of the page. Super annoying.

评论 #41358210 未加载

评论 #41358748 未加载

stackskipton9 个月前

Azure Kubernetes Wrangler (SRE) here, before I turn some LLM loose on my cluster, I need to know what it supports, how it supports it and how I can integrate into my workflow.Videos show CrashLoopBackOff pod and analyzing logs. This works if Pod is writing to stdout but I've got some stuff doing straight to ElasticSearch. Does LLM speak Elastic Search? How about Log Files in the Pod? (Don't get me started on that nightmare)You also show fixing by editing YAML in place. That's great except my FluxCD is going revert since you violated principle of "All goes through GitOps". So if you are going to change anything, you need to update the proper git repo. Also in said GitOps is Kustomize so hope you understand all interactions there.Personally, the stuff that takes most troubleshooting time is Kubernetes infrastructure. Network CNI is acting up. Ingress Controller is missing proper path based routing. NetworkPolicy says No to Pod talking to PostGres Server. CertManager is on strike and certificate has expired. If LLM is quick at identifying those, it has some uses but selling me on "Dev made mistake with Pod Config" is likely not to move the needle because I'm really quick at identifying that.Maybe I'm not the target market and target market is "Small Dev team that bought Kubernetes without realizing what they were signing up for"

评论 #41358927 未加载

评论 #41360935 未加载

评论 #41358911 未加载

评论 #41360976 未加载

henning9 个月前

An AI agent to triage the production issues caused by code generated by some other startup's generative AI bot. I fucking love tech in 2024.

评论 #41363515 未加载

评论 #41359221 未加载

habosa9 个月前

Just some feedback on the landing page: ditch the Stanford/MIT/Carnegie Mellon logos. I’m not hating on elite universities or anything, but it has no relevance here (this is not a research project) and I think it detracts from the brand. I don’t associate academia with pager-carrying operators of critical services.

评论 #41368791 未加载

ronald_petty9 个月前

I think this kind of tooling is one positive aspect of integrating LLM tech in certain workflows/pipelines. Tools like k8sgpt are similar in purpose and show a strong potential to be useful. Look forward to seeing how this progresses.

评论 #41362249 未加载

raunakchowdhuri9 个月前

hmmm idk how I would feel about giving an llm cluster access from a security pov

评论 #41358332 未加载

RadiozRadioz9 个月前

> using AI agents to execute runbooksThis scares me. If I was confident enough in the runbook steps, they'd be automated already by a program. If it's a runbook and not a program, either it's really new or has some subtle nuance around it. "AI" is cool, and humans aren't perfect, but in this scenario I'd still prefer the judgment of a skilled operator who knows the business.> our agents exclusively execute read-only commandsHow is this enforced?The RCA is the better feature of this tool, in my opinion.

drawnwren9 个月前

This is a great idea. I use claude for my most of my unknown K8s bugs and it's impressive how useful it is (far more than my coding bugs).

评论 #41361429 未加载

nerdjon9 个月前

Well the website seems to be down so I can't actually see any information about what LLM you are using, but I seriously hope you are not just sending the data to OpenAI API or something like that and are forcing the use of a private (ideally self hosted) service.I would not want any data about my infrastructure sent to a public LLM, regardless of how sanitized things are.Otherwise, on paper it seems cool. But I worry about getting complicit with this tech. It is going to fail, that is just the reality. We know LLM's will hallucinate and there is not much we can do about it, it is the nature of the tech.SO it might work most of the time, but when it doesn't and you're bashing your head against the wall trying to figure out what is broken. This system is telling you that all of these things are fine, but one of them actually isn't. But it worked enough times that you trust it, so you don't bother double checking.That is before we even talk about having this thing running code for automatic remediation, which I hope no one seriously considers ever doing that.

评论 #41360388 未加载

mdaniel9 个月前

Why would you have your demo video set to "unlisted"? (on what appears to be your official channel) I'd think you'd want to show up in as many places as possible

manveerc9 个月前

Congratulations on the launch! I'm curious—how is what you're building different from other AI SRE solutions out there, like Cleric, Onegrep, Resolve, Beeps, and others?

评论 #41360508 未加载

klinquist9 个月前

Website won't load - just me?

评论 #41359151 未加载

andrewguy99 个月前

For god sakes, SREs need to give up on K8. It was a bad idea, just move on.The answer is not, “let an ai figure it out.”That is legitimately scary.

评论 #41364511 未加载

评论 #41364127 未加载

threeseed9 个月前

> This agent is a combination of separate LLM agents each responsible for a single step of the runbookSomeone needs to explain to me how this is expected to work.Percentage of Hallucinations/Errors x Steps in Runbook = Total Errors0.05 x 10 = 0.5 = 50%

评论 #41359817 未加载

评论 #41359849 未加载