科技回声

I'm a member of a 15 person team who are responsible for several production services. We try to write runbooks for any incident that occurs and tie them to our alerts (DataDog), but I've noticed a couple of problems with how we manage them:<p>1. Runbooks become out of date. This may be because no one on the team feels like they 'own' our runbooks (even though I feel like they should be everyones responsibility to keep up with).<p>2. We keep our runbooks in a GitHub repository for our team, but they are nested under several subfolders depending on the service they reference. This makes them hard to find during an incident (especially if you are woken up at 3am). GitHub search is not the best for this usecase it seems.<p>I'm wondering how other teams manage their runbooks and keep them up to date and easily discoverable? I know there are tools out there like Rundeck (PagerDuty) and what was VictorOps (now Splunk Oncall or something), but these seem to focus on Runbook Automation which is not what we want. I don't want to fully automate our runbooks, but simply make them easily discoverable when we need them and some how encourage keeping them up to date.<p>Any ideas/feedback would be greatly appreciated!

Ask HN: How does your team organize/manage their runbooks?

暂无评论

Ask HN: How does your team organize/manage their runbooks?

暂无评论