Here are the thing you can set up in a day or two. These are things you can do right now that will give you room to breathe so you can start to think more broadly, but most importantly, <i>they don't require change in code</i>:<p>Take a look at Sentry[0]. It will catch exceptions, group them by type, count them, display them with the stack trace, and more. It has integrations with Slack, GitLab/GitHub, etc. It makes creating issues and alerting you easier. You won't have to dig into dozens of log files and miss exceptions that happened a few hours ago but are completely drowned by log messages.<p>Take a look at Prometheus[1] and Grafana[2]: you can set them up and have the state of your infrastructure in a dashboard (services up or down, how many times they were down, for how long, etc. Latency, CPU, GPU, disk space, RAM, etc). Whatever matters to you, put it there. You can create custom alerts (example: storage 90% full) to give yourself a heads up and act beforehand.<p>Look at whatever you do as a team when there's something wrong, figure out all the information you usually need to troubleshoot, and have it all displayed so you can glance at something and know quickly where something is wrong. This is reactive, but it's a start and you won't waste<p>For development, you can have issue templates for incidents [service shut down and nobody saw it, etc], and bugs to lower the barrier to entry to write good quality incident reports. This way, people know what to write, where to write it, and how to write it. Put a tag in the template, the people in CC, whatever makes your life easier. Summary, impact, recovery, investigation, future prevention.<p>One benefit of that is that when you have these incident reports, patterns will emerge fast. It surfaces the most frequent and the most impactful pretty quickly.<p>Once this is done, it will save you time and effort that you can put into reading more on the subject. Search for "Site Reliability Engineering", or "SRE"[3]. There are a few books, some more abstract and others more practical[4][5].<p>Take a look at Enterprise Ready[6]. It talks about the most common requirements and features in an enterprise product (SSO, RBAC, etc).<p>- [0]: <a href="https://sentry.io" rel="nofollow">https://sentry.io</a><p>- [1]: <a href="https://prometheus.io/" rel="nofollow">https://prometheus.io/</a><p>- [2]: <a href="https://grafana.com/" rel="nofollow">https://grafana.com/</a><p>- [3]: <a href="https://en.wikipedia.org/wiki/Site_reliability_engineering" rel="nofollow">https://en.wikipedia.org/wiki/Site_reliability_engineering</a><p>- [4]: <a href="https://sre.google/books/" rel="nofollow">https://sre.google/books/</a><p>- [5]: "Seeking SRE, Conversations about Running Production Systems at Scale"<p>- [6]: <a href="https://www.enterpriseready.io/" rel="nofollow">https://www.enterpriseready.io/</a>