Show HN: I built an open-source tool to make on-call suck less

319 点作者 aray0710 个月前

Hey HN,I am building an open source platform to make on-call better and less stressful for engineers. We are building a tool that can silence alerts and help with debugging and root cause analysis. We also want to automate tedious parts of being on-call (running runbooks manually, answering questions on Slack, dealing with Pagerduty). Here is a quick video of how it works: <a href="https://youtu.be/m_K9Dq1kZDw" rel="nofollow">https://youtu.be/m_K9Dq1kZDw</a>I hated being on-call for a couple of reasons:* Alert volume: The number of alerts kept increasing over time. It was hard to maintain existing alerts. This would lead to a lot of noisy and unactionable alerts. I have lost count of the number of times I got woken up by alert that auto-resolved 5 minutes later.* Debugging: Debugging an alert or a customer support ticket would need me to gain context on a service that I might not have worked on before. These companies used many observability tools that would make debugging challenging. There are always a time pressure to resolve issues quickly.There were some more tangential issues that used to take up a lot of on-call time* Support: Answering questions from other teams. A lot of times these questions were repetitive and have been answered before.* Dealing with PagerDuty: These tools are hard to use. e.g. It was hard to schedule an override in PD or do holiday schedules.I am building an on-call tool that is Slack-native since that has become the de-facto tool for on-call engineers.We heard from a lot of engineers that maintaining good alert hygiene is a challenge.To start off, Opslane integrates with Datadog and can classify alerts as actionable or noisy.We analyze your alert history across various signals:1. Alert frequency2. How quickly the alerts have resolved in the past3. Alert priority4. Alert response historyOur classification is conservative and it can be tuned as teams get more confidence in the predictions. We want to make sure that you aren't accidentally missing a critical alert.Additionally, we generate a weekly report based on all your alerts to give you a picture of your overall alert hygiene.What’s next?1. Building more integrations (Prometheus, Splunk, Sentry, PagerDuty) to continue making on-call quality of life better2. Help make debugging and root cause analysis easier.3. Runbook automationWe’re still pretty early in development and we want to make on-call quality of life better. Any feedback would be much appreciated!

36 条评论

dclowd990110 个月前

> It reduces alert fatigue by classifying alerts as actionable or noisy and providing contextual information for handling alerts.grimace faceI might be missing context here, but this kind of problem speaks more to a company’s inability to create useful observability, or worse, their lack of conviction around solving noisy alerts (which upon investigation might not even be “just” noise)! Your product is welcome and we can certainly use more competition in this space, but this aspect of it is basically enabling bad cultural practices and I wouldn’t highlight it as a main selling point.

评论 #41098153 未加载

评论 #41098517 未加载

评论 #41097360 未加载

评论 #41101454 未加载

评论 #41097341 未加载

评论 #41098675 未加载

评论 #41105187 未加载

评论 #41097971 未加载

评论 #41101876 未加载

评论 #41097661 未加载

评论 #41095815 未加载

评论 #41097255 未加载

评论 #41098664 未加载

jedberg10 个月前

People do not understand the value of classifying alerts as useful after the fact.At Netflix we built a feature into our alert systems that added a simple button at the top of every alert that said, "Was this alert useful?". Then we would send the alert owners reports about what percent of people found their alert useful.It really let us narrow in on which alerts were most useful so that others could subscribe the them, and which were noise, so they could be tuned or shut off.That one button alone made a huge difference in people's happiness with being on call.

评论 #41102119 未加载

评论 #41102954 未加载

评论 #41102851 未加载

aflag10 个月前

It feels to me that using LLM to classify alerts as noisy is just adding risk instead of fixing the root cause of the problem. If an alert is known to be noisy and have appeared on slack before (which is how the LLM would figure out it's a noisy alert), then just remove the alert? Otherwise, how will the LLM know it's noise? Either it will correctly annoy you or hallucinate a reason it figures that alert is just noise.

评论 #41100913 未加载

评论 #41101338 未加载

评论 #41099719 未加载

ravedave510 个月前

The goal for oncall should be to NEVER get called. If someone gets called when they are oncall their #1 task the next day is to make sure that call never happens again. That means either fixing a false alarm or tracking down the root cause of the call. Eventually you get to a state where being called is by far the exception instead of the norm.

评论 #41103032 未加载

评论 #41101476 未加载

评论 #41101481 未加载

Jolter10 个月前

Telecoms solved this problem fifteen years ago when they started automating Fault Management (google it).Granted, neural networks were not generally applicable to this problem at the time, but this whole idea seems like the same problem being solved again.Telecoms and IT used to supervise their networks using Alarms, in either a Network Management System (NMS) or something more ad-hoc like Nagios. There, you got structured alarms over a network, like SNMP traps, that got stored as records in a database. It’s fairly easy to program filters using simple counting or more complex heuristics against a database.Now, for some reason, alerting has shifted to Slack. Naturally since the data is now unstructured text, the solution involves an LLM! You build complexity into the filtering solution because you have an alarm infrastructure that’s too simple.

评论 #41101474 未加载

评论 #41101894 未加载

mads_quist10 个月前

Founder of All Quiet here: <a href="https://allquiet.app" rel="nofollow">https://allquiet.app</a>.We're building a tool in the same space but opted out of using LLMs. We've received a lot of positive feedback from our users who explicitly didn't want critical alerts to be dependent on a possibly opaque LLM. While I understand that some teams might choose to go this route, I agree with some commentators here that AI can help with symptoms but doesn't address the root cause, which is often poor observability and processes.

RadiozRadioz10 个月前

> Slack-native since that has become the de-facto tool for on-call engineers.In your particular organization. Slack is one of many instant messaging platforms. Tightly coupling your tool to Slack instead of making it platform agnostic immediately restricts where it can be used.Other comment threads are already discussing the broader issues with using IM for this job, so I won't go into it here.Regardless, well done for making something.

评论 #41097598 未加载

评论 #41095688 未加载

评论 #41095639 未加载

throw15675422810 个月前

I don't want to be relying on another flaky LLM for anything mission critical like this.Just fix the original problem, don't layer an LLM into it.

评论 #41099739 未加载

评论 #41100099 未加载

Terretta10 个月前

Note that according to StackOverflows dev survey, more devs use Teams than Slack, over 50% were in Teams. (The stat was called popularity but really should have been prevalence, since a related stat showed devs hated Teams even more than they hated Slack.) Teams has APIs too, and with Microsoft Graph working you can do a lot more than just Teams for them.More importantly, and not mentioned by StackOverflow, those devs are among the 85% of businesses using M365, meaning they have "Sign in with Microsoft" and are on teams that will pay. The rest have Google and/or Github.This means despite being a high value hacking target (accounts and passwords of people who operate infrastructure, like the person owned from Snowflake last quarter) you don't have to store passwords therefore can't end up on Have I Been Pwned.

voidUpdate10 个月前

Filtering whether a notification is important or not through an LLM, when getting it wrong could cause big issues, is mildly concerning to me...

nprateem10 个月前

Almost all alerting issues can be fixed by putting managers on call too (who then have to attend the fix too).It suddenly becomes a much higher priority to get alerting in order.

asdf696910 个月前

I don’t really understand the use case. If there’s a way to programmatically tell that it’s a false alarm then there must also be a way to not create the alert in the first placeI’ve never seen an issue that’s conclusively a false alarm without investigating at all. Just delete the alarm? An LLM will never find something like another team is accidentally stress testing my service but it does happenAnother perfect example is when the queen died and it looked like an outage for UK users. Can your LLM read the news? ChatGPT doesn’t even know if she’s aliveI expect you will need AGI before large companies will trust your product.

makmanalp10 个月前

Underrated oncall problem that needs solving is scheduling IMHO:- We have a weekday (2 shifts) / weekend (1 slightly longer shift including friday morning to allow people to take long weekends) oncall rotation as well as a group-combined oncall schedule which gets finnicky.- When people join or leave the rotation, making sure nothing shifts before a certain date or swapping one person with another without changing the rest and other things are a massive pain in the butt- Combine this with a company holiday list - usually there's different policies and expectations during those. - Allow custom shift change times for people in different timezones.- We have "oncall training" / shadowing for newbies, automate the process of substituting them in gradually, first with a shared daytime rotation and then on their own etc.- Make oncall trades (if you can't make your shift simpler)Gripes with PD:- Pagerduty keeps insisting I'm "always on call" because I'm on level N of a fallback pager chain which makes their "when oncall next" box useless - just let me pick.- Similarly, pagerduty's google calendar export will just jam in every service you're remotely related to and won't let you pick when exporting, even though it will in their UI. So I can't just have my oncall schedule in google calendar without polluting it to all hell.

评论 #41102884 未加载

lmeyerov10 个月前

Big fan of this direction. The architecture resonates! The base lining is interesting, I'm curious how you think about that, esp for bootstrapping initially + ongoing.We are working on a variant being used more by investigative teams than IT ops - so think IR, fraud, misinfo, etc - which has similarities but also domain differences. If of interest to someone with an operational infosec background (hunt, IR, secops) , and esp US-based, the Louie.AI team is hiring an SE + principal here.

CableNinja10 个月前

I get your sentiment, but theres another side of this coin that everyone is forgetting, hilariously.You can tune your monitoring!Noisy alert that tends to be a false positive but not always? Tune alert message to only send if the issue continues for more than a minute, or if the check fails 3 times in a row. Theres hundreds of ways to tweak a monitor to match your environment.Best of all? It takes 30 seconds at most. Find the trigger, adjust slightly, and after maybe 1-2 tries, youll be getting 1 false positive sometimes, and actual alerts when they happen, compared to 99% false alerts, all the time.Oh and did you know any monitoring solution worth its salt can execute things automatically on alerts, and then can alert you if that thing fails?Also, Slack is not a defacto anything. Its a chat tool in a world of chat tools

jpb010410 个月前

I love this space; stability & response! After my last full-time gig, I was also frustrated with the available tooling and ONLY wanted an on-call scheduling tool with simple calendar integration. So I built: <a href="https://majorpager.com/" rel="nofollow">https://majorpager.com/</a> Not OSS, but very simple and hopefully pretty straightforward to use. I'm certainly wide open to feedback.

solatic10 个月前

In my current workplace (BigCo), we know exactly what's wrong with our alert system. We get alerts that we can't shut off, because they (legitimately) represent customer downtime, and whose root cause we either can't identify (lack of observability infrastructure) or can't fix (the fix is non-trivial and management won't prioritize).Running on-call well is a culture problem. You need management to prioritize observability (you can't fix what you can't show as being broken), then you need management to build a no-broken-windows culture (feature development stops if anything is broken).Technical tools cannot fix culture problems!edit: management not talking to engineers, or being aware of problems and deciding not to prioritize fixing them, are both culture problems. The way you fix culture problems, as someone who is not in management, is to either turn your brain off and accept that life is imperfect (i.e. fix yourself instead of the root cause), or to find a different job (i.e. if the culture problem is so bad that it's leading to burnout). In any event, cultural problems cannot be solved with technical tools.

评论 #41095770 未加载

评论 #41096421 未加载

评论 #41095686 未加载

评论 #41095835 未加载

评论 #41095669 未加载

评论 #41095724 未加载

maximinus_thrax10 个月前

Nice work, I always appreciate the contribution to the OSS ecosystem.That said, I like that you're 'saying out loud' with this. Slack and other similar comm tooling has always been advertised as a productivity booster due to their 'async' nature. Nobody actually believes this anymore and coupling it with the oncall notifications really closes the lid on that thing.

评论 #41095711 未加载

topaztee10 个月前

co-founder of merlinn here: <a href="https://merlinn.co" rel="nofollow">https://merlinn.co</a> | <a href="https://github.com/merlinn-co/merlinn">https://github.com/merlinn-co/merlinn</a> We're also building a tool in the same space with the option of choosing your own model (private llms) + we're open source with a multitude of integrations.good to see more options in this space! especially OS. I think de-noising is a good feature given alert fatigue is one of the repeating complaints of on-callers.

deepfriedbits10 个月前

Nice job and congratulations on building this! It looks like your copy is missing a word in the first paragraph:> Opslane is a tool that helps (make) the on-call experience less stressful.

评论 #41096278 未加载

Arch-TK10 个月前

We could stop normalising "on-call" instead.

评论 #41100903 未加载

snihalani10 个月前

can you build a cheaper datadog instead?

评论 #41100653 未加载

评论 #41101809 未加载

tryauuum10 个月前

every time I see notifications in Slack / Telegram it makes me depressed. Text messengers were not designed for this. If you get the "something is wrong" alert it becomes part of history, it won't re-alert you if it's still present. And if you have more than one type of alert it will be lost in historyI guess alerts to messengers are OK as long it's only a couple manually created ones, and there should be a graphical dashboard to learn the rest of problems

评论 #41096750 未加载

评论 #41095584 未加载

评论 #41096717 未加载

评论 #41095641 未加载

评论 #41095291 未加载

T1tt10 个月前

is this only on the frontpage because this is an HN company?

c0mbonat0r10 个月前

if this is open-source project how are you planning to make this a sustainable business? also why the choice of apache 2.0

T1tt10 个月前

how can you prove it works and doesnt hallucinate? do you have any actual users that have installed it and found it useful?

lars_francke10 个月前

Shameless question tangential related to the topic.We are based in Europe and have the problem that some of us sometimes just forget we're on call or are afraid that we'll miss OpsGenie notifications.We're desparately looking for a hardware solution. I'd like something similar to the pagers of the past but at least here in Germany they don't really seem to exist anymore. Ideally I'd have a Bluetooth dongle that alerts me on configurable notifications on my phone. Carrying this dongle for the week would be a physical reminder I'm on call.Does anyone know anything?

评论 #41095841 未加载

评论 #41096926 未加载

评论 #41096436 未加载

7bit10 个月前

> * Alert volume: The number of alerts kept increasing over time. It was hard to maintain existing alerts. This would lead to a lot of noisy and unactionable alerts. I have lost count of the number of times I got woken up by alert that auto-resolved 5 minutes later.I don't understand this. Either the issue is important and requires immediate human action -- or the issue can potentially resolve itself and should only ever send an alert if it doesn't after a set grace period.The way you're trying to resolve this (with increasing alert volumes) is the worst approach to both of the above, and improves nothing.

protocolture10 个月前

I feel like this would be a great tool for people who have had a much better experience of On Call than I have had.I once worked for a string of businesses that would just send everything to on call unless engineers threatened to quit. Promised automated late night customer sign ups? Haven't actually invested in the website so that it can do that? Just make the on call engineer do it. Too lazy to hire off shore L1 technical support? Just send residential internet support calls to the the On Call engineer! Sell a service that doesn't work in the rain? Just send the on call guy to site every time it rains so he can reconfirm yes, the service sucks. Basic usability questions that could have been resolved during business hours? Does your contract say 24/7 support? Damn, guess thats going to On Call.Shit even in contracting gigs where I have agreed to be "On Call" for severity 1 emergencies, small business owners will send you things like service turn ups or slow speed issues.

评论 #41101379 未加载

EGreg10 个月前

One of the “no-bullshit” positions I have arrived at over the years is that “real-time is a gimmick”.You don’t need that Times Square ad, only 8-10 people will look up. If you just want the footage of your conspicuous consumotion, you can easily photoshop it for decades already.Similarly, chat causes anxiety and lack of productivity. Threaded forums like HN are better. Having a system to prevent problems and the rare emergency is better than having everyone glued to their phones 24/7. And frankly, threads keep information better localized AND give people a chance to THINK about the response and iterate before posting in a hurry. When producers of content take their time, this creates efficiencies for EVERY INTERACTION WITH that content later, and effects downstream. (eg my caps lock gaffe above, I wont go back and fix it, will jjst keesp typing 111!1!!!)Anyway people, so now we come to today’s culture. Growing up I had people call and wish happy birthday. Then they posted it on FB. Then FB automated the wishes so you just press a button. Then people automated the thanks by pressing likes. And you can probably make a bot to automate that. What once was a thoughtful gesture has become commoditized with bots talking to bots.Similar things occurred with resumes and job applications etc.So I say, you want to know my feedback? Add an AI agent that replies back with basic assurances and questions to whoever “summoned you”, have the AI fill out a form, and send you that. The equivalent of front-line call center workers asking “Have you tried turning it on and off again” and “I understand it doesn’t work, but how can we replicate it.”That repetitive stuff should he done by AI and build up an FAQ Knowledge Base for bozos and then only bother you if it came across a novel problem it hasn’t solved yet, like an emergency because, say, there’s a windows BSOD spreading and systems don’t boot up. Make the AI do triage and tell the differencd.

LunarFrost8810 个月前

Really cool!

racka10 个月前

Really cool!Anyone know of a similar alert UI for data/business alarms (eg installs dropping WoW, crashes spiking DoD, etc)?Something that feeds of Snowflake/BigQuery, but with a similar nice UI so that you can quickly see false positives and silence them.The tools I’ve used so far (mostly in-house built) have all ended in a spammy slack channel that no one ever checks anymore.

评论 #41101875 未加载

Flop733110 个月前

Is this for missile defense systems or something? What's possibly so important that you need to be woken up for it?

评论 #41099234 未加载

评论 #41097617 未加载

theodpHN10 个月前

What you've come up with looks helpful (and may have other applications as someone else noted), but you know what also makes on-call suck less? Getting paid for it, in $ and/or generous comp time. :-)<a href="https://betterstack.com/community/guides/incident-management/on-call-pay/" rel="nofollow">https://betterstack.com/community/guides/incident-management...</a>Also helpful is having management that is responsive to bad on-call situations and recognizes when capable, full-time around-the-clock staffing is really needed. It seems too few well-paid tech VPs understand what a 7-Eleven management trainee does, i.e., you shouldn't rely on 1st shift workers to handle all the problems that pop up on 2nd and 3rd shift!

评论 #41096622 未加载

throwaway98439310 个月前

Don't send an alert at all unless it is actionable. Yes, I get it, you want alerts for everything. Do you have a runbook that can explain to a complete novice what is going on and how to fix the problem? No? Then don't alert on it.The only way to make on-call less stressful is to do the boring work of preparing for incidents, and the boring work of cleaning up after incidents. No magic software will do it for you.

sanj00110 个月前

Using LLMs to classify noisy alerts is a really clever approach to tackling alert fatigue! Are you fine tuning your own model to differentiate between actionable and noisy alerts?I'm also working on an open source incident management platform called Incidental (<a href="https://github.com/incidentalhq/incidental">https://github.com/incidentalhq/incidental</a>), slightly orthogonal to what you're doing, and it's great to see others addressing these on-call challenges.Our tech stacks are quite similar too - I'm also using Python 3, FastAPI!

评论 #41096349 未加载

评论 #41098567 未加载

评论 #41100672 未加载

评论 #41096426 未加载

36 条评论

dclowd990110 个月前

评论 #41098153 未加载

评论 #41098517 未加载

评论 #41097360 未加载

评论 #41101454 未加载

评论 #41097341 未加载

评论 #41098675 未加载

评论 #41105187 未加载

评论 #41097971 未加载

评论 #41101876 未加载

评论 #41097661 未加载

评论 #41095815 未加载

评论 #41097255 未加载

评论 #41098664 未加载

jedberg10 个月前

评论 #41102119 未加载

评论 #41102954 未加载

评论 #41102851 未加载

aflag10 个月前

评论 #41100913 未加载

评论 #41101338 未加载

评论 #41099719 未加载

ravedave510 个月前

评论 #41103032 未加载

评论 #41101476 未加载

评论 #41101481 未加载

Jolter10 个月前

评论 #41101474 未加载

评论 #41101894 未加载

mads_quist10 个月前

RadiozRadioz10 个月前

评论 #41097598 未加载

评论 #41095688 未加载

评论 #41095639 未加载

throw15675422810 个月前

I don't want to be relying on another flaky LLM for anything mission critical like this.Just fix the original problem, don't layer an LLM into it.

评论 #41099739 未加载

评论 #41100099 未加载

Terretta10 个月前

voidUpdate10 个月前

Filtering whether a notification is important or not through an LLM, when getting it wrong could cause big issues, is mildly concerning to me...

nprateem10 个月前

Almost all alerting issues can be fixed by putting managers on call too (who then have to attend the fix too).It suddenly becomes a much higher priority to get alerting in order.

asdf696910 个月前

makmanalp10 个月前

评论 #41102884 未加载

lmeyerov10 个月前

CableNinja10 个月前

jpb010410 个月前

solatic10 个月前

评论 #41095770 未加载

评论 #41096421 未加载

评论 #41095686 未加载

评论 #41095835 未加载

评论 #41095669 未加载

评论 #41095724 未加载

maximinus_thrax10 个月前

评论 #41095711 未加载

topaztee10 个月前

deepfriedbits10 个月前

Nice job and congratulations on building this! It looks like your copy is missing a word in the first paragraph:> Opslane is a tool that helps (make) the on-call experience less stressful.

评论 #41096278 未加载

Arch-TK10 个月前

We could stop normalising "on-call" instead.

评论 #41100903 未加载

snihalani10 个月前

can you build a cheaper datadog instead?

评论 #41100653 未加载

评论 #41101809 未加载

tryauuum10 个月前

评论 #41096750 未加载

评论 #41095584 未加载

评论 #41096717 未加载

评论 #41095641 未加载

评论 #41095291 未加载

T1tt10 个月前

is this only on the frontpage because this is an HN company?

c0mbonat0r10 个月前

if this is open-source project how are you planning to make this a sustainable business? also why the choice of apache 2.0

T1tt10 个月前

how can you prove it works and doesnt hallucinate? do you have any actual users that have installed it and found it useful?

lars_francke10 个月前

评论 #41095841 未加载

评论 #41096926 未加载

评论 #41096436 未加载

7bit10 个月前

protocolture10 个月前

评论 #41101379 未加载

EGreg10 个月前

LunarFrost8810 个月前

Really cool!

racka10 个月前

评论 #41101875 未加载

Flop733110 个月前

Is this for missile defense systems or something? What's possibly so important that you need to be woken up for it?