TechEcho

3 comments

avl999over 4 years ago

This is an incredibly broad question and touches on many aspects of software engineering, devops and operations. For operations, I would recommend having the team read the google SRE book <a href="https://sre.google/workbook/table-of-contents/" rel="nofollow">https://sre.google/workbook/table-of-contents/</a> it has everything one would need to setup a modern operations infrastructure and associated best practices. <a href="https://sre.google/workbook/table-of-contents/" rel="nofollow">https://sre.google/workbook/table-of-contents/</a>Sounds like you are unhappy with the quality of code and error handling. Seems like it was written by junior devs? I would recommend having the team read the usually recommended books if they haven't. At the bare minimun:* Refactoring by Fowler* Clean Code by Robert MartinIf you are using an OO language:* Practical Object-Oriented Design in RubyThe book has "in Ruby" in the title but it's a general purpose book on what make OO design "good"Then follow it up with* Patterns of Enterprise Architecture* Clean ArchitectureThese books are not perfect or the be all end all, there are parts of them that might be slightly dated but they get you to a large chunk of the way to the promised land.Lastly, if the application is working, the users are happy, there is no bug infestation and you are not having issues with releasing new features, don't feel the pressure to immediately "fix" the code.

tarun_anandover 4 years ago

The best place to start is to read open source code. Start with small but popular repository on GitHub.

评论 #25715278 未加载

Jugurthaover 4 years ago

Here are the thing you can set up in a day or two. These are things you can do right now that will give you room to breathe so you can start to think more broadly, but most importantly, they don't require change in code:Take a look at Sentry[0]. It will catch exceptions, group them by type, count them, display them with the stack trace, and more. It has integrations with Slack, GitLab/GitHub, etc. It makes creating issues and alerting you easier. You won't have to dig into dozens of log files and miss exceptions that happened a few hours ago but are completely drowned by log messages.Take a look at Prometheus[1] and Grafana[2]: you can set them up and have the state of your infrastructure in a dashboard (services up or down, how many times they were down, for how long, etc. Latency, CPU, GPU, disk space, RAM, etc). Whatever matters to you, put it there. You can create custom alerts (example: storage 90% full) to give yourself a heads up and act beforehand.Look at whatever you do as a team when there's something wrong, figure out all the information you usually need to troubleshoot, and have it all displayed so you can glance at something and know quickly where something is wrong. This is reactive, but it's a start and you won't wasteFor development, you can have issue templates for incidents [service shut down and nobody saw it, etc], and bugs to lower the barrier to entry to write good quality incident reports. This way, people know what to write, where to write it, and how to write it. Put a tag in the template, the people in CC, whatever makes your life easier. Summary, impact, recovery, investigation, future prevention.One benefit of that is that when you have these incident reports, patterns will emerge fast. It surfaces the most frequent and the most impactful pretty quickly.Once this is done, it will save you time and effort that you can put into reading more on the subject. Search for "Site Reliability Engineering", or "SRE"[3]. There are a few books, some more abstract and others more practical[4][5].Take a look at Enterprise Ready[6]. It talks about the most common requirements and features in an enterprise product (SSO, RBAC, etc).- [0]: <a href="https://sentry.io" rel="nofollow">https://sentry.io</a>- [1]: <a href="https://prometheus.io/" rel="nofollow">https://prometheus.io/</a>- [2]: <a href="https://grafana.com/" rel="nofollow">https://grafana.com/</a>- [3]: <a href="https://en.wikipedia.org/wiki/Site_reliability_engineering" rel="nofollow">https://en.wikipedia.org/wiki/Site_reliability_engineering</a>- [4]: <a href="https://sre.google/books/" rel="nofollow">https://sre.google/books/</a>- [5]: "Seeking SRE, Conversations about Running Production Systems at Scale"- [6]: <a href="https://www.enterpriseready.io/" rel="nofollow">https://www.enterpriseready.io/</a>

3 comments

avl999over 4 years ago

tarun_anandover 4 years ago

The best place to start is to read open source code. Start with small but popular repository on GitHub.

评论 #25715278 未加载

Jugurthaover 4 years ago

Ask HN: Refactoring Code to Enterprise Level

3 comments

Ask HN: Refactoring Code to Enterprise Level

3 comments