We've been looking at making CloudWatch (CW) alarms an automated part of our infra. Here are some findings that may help:<p>- The semantics of CW seem convoluted. But once you stare at API docs for long enough, the core concepts are easy to grok: Metrics (regularly submitted from machine to CW), Alarms (abstractions for defining the logic of an alarm based on behavior of Metrics), and SNS Topics (could be just an email address, for what to do when an Alarm goes off).<p>- Once you get the data model right, all implementations (click ops, terraform, bash via awscli, boto3, etc) are all visibly identical.<p>- Some Metrics come for free, e.g. CPU usage is reported by any EC2 instance to CW. For some other Metrics, notably disk and memory usage, you need to configure your instance to report them to CW. This is where the OP's monitoring scripts come in.<p>- The monitoring scripts and the cron config the OP refers to are deprecated [0]. Instead there's a new CloudWatch Agent [1]: you install the package on your EC2 instances, provide a configuration file to it, and you're set.<p>[0] <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scripts.html" rel="nofollow">https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scri...</a><p>[1] <a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html" rel="nofollow">https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...</a>
High CPU alerts are terrible alerts. If I'm paying per instance, I <i>want</i> CPU utilization to be high. If it's low, I'm wasting money. So now what I need is an alert where it's not high, but somewhere between "high and too high". You know, like when there's an arbitrary spike because the Java is doing some GC. Or you have a one minute spike of traffic that fires an Ops Genie alert at 2am but auto-clears between when the on-call engineer wakes up and when they log in to check.<p>For the love of $DIETY, if you're going to set up CloudWatch monitoring, create custom metrics that map to your business outcomes and alert when <i>those</i> go off the rails.