Hi Paessler support team.
We've experienced some similar problems, and i think it could be be a bug in the remote probe part of your software. We'd like your input and potential help in diagnosing the problem. We're running PRTG 15.1.13.1382+
We have about 800 sensors in our installation. I've seen the scenario i'm about to describe unfold twice now with exactly the same symptoms.
Part of our setup is a lot of WMI queries against a series of our windows servers. Lets call these sensors WA1, WA2, WB1, WB2 etc. denoting W (WMI sensor), A (machine A), 1 (sensor number 1). The exact sensor type is: "WMI Vital System Data (V2) sensor"
For some other devices we have a some performance counter sensors. We'll call these PX1, PX2, PY1, PY2 etc. denoting P (Perfcounter sensor), X (machine X), 1 (sensor number 1). The exact sensor type is "PerfCounter Custom sensor".
What we've experienced twice now is that some runaway process on a machine against which we run WMI sensors ate all the memory on the machine and then some. This caused the machine to become very "sluggish" and pretty much fail to respond to anything in a timely manner. This caused the WMI sensors to start timing out (for good reason) and this is exactly what we wanted to see because it revealed the problem. So in our error scenario for the sake of argument lets say sensors WB1 & WB2 go into a state of timeout (i believe the exact error was "Message was cancelled by the message filter", i was able to get the same error doing WMI queries against machine B from powershell).
Shortly after this happened we also started to see timeouts on all our performance counter sensors against completely different machines. The error for these sensors were "The wait operation timed out (Performance Counter error 0x102)".
I.e. All of the "PerfCounter Custom sensor" PX1, PX2, PY1, PY2 etc. started displaying this timeout.
I went to the machines X,Y,Z etc and verified that there were no problems. The performance counters when viewed locally were fine. I also went to the server running our remote probe and tried to query the performance counters on machines X,Y,Z etc using something like powershells "Get-Counter" or the performance monitor and remote connecting to machines X,Y,Z again no problems were observed. The counters were replying just fine and giving correct values. However PRTG continued displaying timeouts.
In the meantime we booted the server B with the WMI sensors that were timing out, and the WMI sensors WB1 & WB2 started responding again and giving normal values. However our PerfCounter sensors PX1, PX2, PY1, PY2 etc were all still in a state of timeout.
I've tried pausing the PerfCounter sensors for up to 30 minutes and resuming them to no avail, they timeout again as soon as they are resumed.
This scenario has played out twice now, and in both cases the only thing i could do to resolve the issue was to restart the PRTG remote probe service completely. That seems to "clear the pipes" so to say, and when the probe comes back online, and manages to catch up with the backlog of queries, then all the perf counter sensors are working fine again.
So the very short version of all this is: Timeouts on WMI sensors in one part of the architecture seems to cause all of our PerfCounter sensors against other parts of the architecture to start timing out as well with the error "The wait operation timed out (Performance Counter error 0x102) " for no apparent reason. The only way I've found to fix the problem is to restart the remote probe which is not very desirable because it leaves us "in the dark" so to say, with regards to monitoring while it catches back up.
Is this scenario something you can replicate in your internal test environments or do you have some potential ideas on what could be causing it?
Best Regards
Peter Dahlgaard
Add comment