Monitoring and alerting has matured a lot in the last decade. We have tools to manage heaps of data, we have the ELK stack, we have NewRelic and tons of other SaaS options (I prefer niche tools like AppSignal) which even come with AI based anomaly detection. With all this data and tools available it gets harder and harder to see the forest for the trees. But there’s a simple rule to decide if your next alert adds any value.
If you’re deciding which alerts you want to establish, ask yourself:
Does this help me sleep at night?
If it does, great. Set that alert up. Make sure it interrupts your sleep when things are not how they should be. If they’re not you might continue to ask yourself if they might make somebody else sleep better.
As a team you might not always agree about your alerts and thresholds. Everyone has a different approach to risk aversion. It’s important that you take the time and make all the implicit expectations explicit. Try to find a common ground that ensures important alerts are visible and that unimportant ones are discarded.
You need actionable alerts
Make sure your alerts are actionable. If you get woken up at 4am but there’s nothing you can do about it, your alerting sucks.
What happens often is that an alert is actionable, it’s just not actionable for you. In this scenario you are essentially a router. This means your alerts are not specific enough. Or worse, the responsible people managed to swindle themselves out of pager duty.
Common causes of bad alerting
When we don’t know the system well enough we tend to err on the safe side and over-alert. We’re too afraid that we missed a crucial bit, it’s driven by fear.
Another driving force might be an engineering organisation which makes alerting a best practice without fully embracing the nuances of meaningful alerts. If every host and every application is rolled out with a pre-defined one size fits all alerts and every team is blindly pushed to add a number of domain specific alerts you’ll create a lot of noise. In other words, this is putting the cart before the horse.
Besides unnecessarily stealing someone’s sleep noisy alerting systems will lead to another problem: alert fatigue. The few important alerts are drowned out by the noise of useless alerts. As a result nobody is looking at them anymore because some alert is constantly buzzing.
Better alerts equal better sleep
The aim is not just to sleep more but to worry less. If you have fewer alerts but they are more meaningful and more specific you will sleep better and waste less time.