When a device goes down we get notification messages from every sensor with a notification configured saying that it can't be read. For example, we'd get notifications for: CPU, Memory, Uptime, Volume C, Volume F, Volume G, Volume H etc
When each device has more than 10 sensors, we do get a lot of notifications that that can hide the actual issue.
The reason why this became an issue is that three servers went down the other day due to a problem at our hosting provider and we got 40 or so notifications. We couldn't easily tell from the notifications that it was a problem with just 3 servers because we were bombarded with the notifications.
It would be nice if we could say that device X is down when we fail to connect to Y sensors and to then to send out a "device X is down" notification and suppress the others until it comes back up.
Add comment