I have a custom sensor returning a couple of channels.
For a clearer image, the structure of a channel is the following:
<result>
<channel>server0</channel>
<value>100</value>
<Unit>Percent</Unit>
<LimitMinWarning>80</LimitMinWarning>
<LimitMinError>60</LimitMinError>
<LimitWarningMsg>Process is in maintenance or starting.</LimitWarningMsg>
<LimitErrorMsg>Process is stopped or in an unknown state</LimitErrorMsg>
<LimitMode>1</LimitMode>
</result>
At some point, one of the channels returned a value of 0, leading to a red status for the whole sensor(as expected), the message was the following: "(bootstrap) is below the error limit of 60% in bootstrap. Process is stopped or in an unknown state."
The issue starts when common changes in the monitored environment lead to channels receiving no data from one point forward. This is expected behavior, and no data for a couple of channels is fine. However, we encountered an issue in which another channel received a value of 0, but the error message remained identical("bootstrap" instead of the channel name that is actually down.)
All the channels have a static structure, thresholds are not changed at any point. Restarting the sensor leads to no improvement. Recreating the sensor works correctly, but it is not a solution, considering our client's monitored environment size would mean a lot of unnecessary manual labor for fixing this issue. Even more, the loss of historical data is a deal-breaker from their point of view.
Any clarifications are of great help.
Add comment