科技回声

8 条评论

slap_shot超过 1 年前

I'm surprised how often I speak to technical teams that do not utilize PagerDuty (or an equivalent alternative). As PagerDuty integrates with nearly any external system, it separates the collection of telemetry from the incident response lifecycle, i.e. what is wrong? who should be or is looking into this? what did we learn from this? how often is this happening?Personally, I find notifications in Slack to be an anti-pattern: a lot of teams expect someone to just "pick up" the incident based on their availability or expertise and _maybe_ the resolution is documented. Assigning direct responsibility by component and on-call schedule appending the RCA reduces the time-to-resolution and overall toil of the process.

nip超过 1 年前

Custom built monitoring on top of CloudWatch logs: we subscribe to the log groups and parse the logs.Errors are reported in dedicated slack channelsThe “MVP” was built in 1 week after we were faced with an outrageous bill from an observability vendor and decided to give a shot at implementing it ourselves.In total I’d say that we invested 2 additional weeks of man-hour to get to where we are today.It has worked extremely well for us and has needed little maintenance (granted we pay AWS to not have to do that maintenance)

mtmail超过 1 年前

StatusCake has a feature to call me. It's a horrible artificial voice "your website $name is down" but I'm fine with anybody shouting at me at 3am. The phone number is from the United States and I don't need to add it to my phone book because that's the only US phone calling me. (For people inside the US you might think it's another robocall)

guybedo超过 1 年前

i keep it simple with an uptime monitoring service that monitors all the elements of my stack and run tests every minute:- regular http monitoring for websites- run test queries on my sql & mongo databases- check that rabbitmq queues are not overflowing- check that docker container are upIf something goes wrong, email & telegram alerts.fwiw i'm using <a href="https://uptimefunk.com" rel="nofollow noreferrer">https://uptimefunk.com</a>

rozenmd超过 1 年前

Uptime monitoring + cron job monitoring via OnlineOrNot (dogfooding my own product), with alerts going to PagerDuty (set up to email -> SMS -> call me if I don't acknowledge), and a "public" alert in a Slack channel.

girishso超过 1 年前

Nothing fancy, Alerts are posted in a slack channel.

Cicero22超过 1 年前

We have someone check grafana a few times a day and alert us if there's an issue. Not great, but it works

0xebo超过 1 年前

webhooks to slack

8 条评论

slap_shot超过 1 年前

nip超过 1 年前

mtmail超过 1 年前

guybedo超过 1 年前

rozenmd超过 1 年前

girishso超过 1 年前

Nothing fancy, Alerts are posted in a slack channel.

Cicero22超过 1 年前

We have someone check grafana a few times a day and alert us if there's an issue. Not great, but it works

0xebo超过 1 年前

webhooks to slack

Ask HN: What is the error alerting stack at your startup?

8 条评论

Ask HN: What is the error alerting stack at your startup?

8 条评论