‘Prevents an issue like this: "I recently ran into a situation where I spent 6 hours understanding how something works that would have taken 20 minutes if the relevant information was stored somewhere."‘<p>Recently I spent 10 hours on a usually routine task. At the end of the slog, my first thought wasn’t “I should spend more time writing that up!” The article was a good reminder that it’s worth scribbling something down. Even the basics of the “gotcha!” and snippets of code for debugging could save someone else (future me?) another 10 hours.<p>One thing I didn’t get from the article was how runbooks are created. It mentions the “sticky note on someone’s desk” approach and the “workflows for everything” approach. There’s a lot of ground in between. I guess people write lots of how-tos and eventually they’re turned into a runbook?
gitlab runbooks is a great place to learn: <a href="https://docs.gitlab.com/ee/user/project/clusters/runbooks/" rel="nofollow">https://docs.gitlab.com/ee/user/project/clusters/runbooks/</a>
And here are some actual runbooks which Societe Generale have donated to the community: <a href="https://github.com/certsocietegenerale/IRM/tree/master/EN" rel="nofollow">https://github.com/certsocietegenerale/IRM/tree/master/EN</a>
I keep personal "runbooks" for a lot of the common work I deal with over time. Eventually, this stuff gets automated where possible, but taking the time to work through all of the problems, write it down, and do it in a way that I can show someone else has helped me make sure that when I sit down to automate something, I truly understand the "domain".<p>It also helps tremendously when you have someone reach out to you during off-hours when you can just look through some documentation you have on hand to blaze through a task that takes a lot less time than if you had to figure things out from scratch.
We use ms word, one run book per app for third part applications. That have to be reviewed/updated/APPROVED at least once per year, not optional. It's this that adds the value, not how the info is stored.
Make sure whatever you keep them in has an offline option. I was on the end of handling an outage and confluence blew up.<p>We're using Markdown in github now with clone per SRE.
I'm founder of a startup in this area. Our product is NOT just the usual automated runbook approach.<p>If anyone has explored more sophisticated solutions than wiki pages, I would love to talk and learn from your experience
"We have a major incident with connectivity to the building, login to the knowledge server and see what the runbook says...oh!"<p>For mission critical stuff, print out your Incident Management processes and have a physical 'Master Runbook' in a prominent place in your department/cubicle/office.<p>Also, printed procedures with numbered steps and checkboxes allows for a visual record of progress, plus there's room for notes when things deviate from the expected.<p>Each annotated runbook then becomes a reference for the Incident/Problem management/RCA write-up - unless, of course, you fancy following the runbook on one screen (if you can), while also updating the service management ticket on another (if you can) and getting out comms to senior stakeholders (if you can) and dealing with the sudden influx of tickets, phone calls and emails (if all the systems are still accessible).<p>Plus, if you are called into a meeting, or have to go check something, you can take the paperwork with you.