When running PRTG on a virtual machine, for example, VMware, I encounter some performance and stability issues. Are there recommendations for settings when running PRTG on a VM?
This article applies as of PRTG 19
|Important Notice: We do not update this article anymore. For up-to-date rules and guidelines for running large installations of PRTG on VMware, see our Best Practice Guide: Running large installations of PRTG in a virtual environment.|
Checklist for Running PRTG on VMware
If you run PRTG on a virtual machine, keep in mind the following recommendations for acceptable performance:
- For a cluster setup, divide these values by the number of cluster nodes when running the PRTG core server on a virtual machine.
- Use really fast storage: Physical storage directly on the PRTG core server is much faster than storage over the network. However, network storages can be configured in failover mode: If an ESX server fails, the storage will be moved immediately to another one, so it has little impact on PRTG. Measure the advantages against the disadvantages in your setup.
- Use long sensor scanning intervals of 15 minutes or more.
- Make sure that clocks are synchronized: On your VM’s guest operating system, VMware Tools should be installed. Ensure that the system time of VMware is synchronized with the time of the host system or an NTP server, but not both. Check that the host syncs time correctly as well. If clocks are unsynchronized, PRTG probes might be unable to connect, and HTTP requests to the PRTG webserver can fail as well, mainly when using HTTPS. In addition, HTTP sensors can fail (especially when monitoring websites via HTTPS) if the time difference between the system running the probe and the target system is too big. Running the Network Time Protocol (NTP) client on the ESX host and the domain controller can keep clocks synchronized over a network.
- Use VMware 5 to reduce resource issues. Avoid earlier VMware versions
- Consider that PRTG creates a lot of input/output (I/O) on your system. Other VMs might interfere with this traffic. To reduce related bottlenecks, use VMware 5. Avoid earlier VMware versions.
- Configure your VM to have the CPU cores on the same CPU socket. CPU cores on individual sockets may result in serious performance issues. Scheduling threads over different sockets in such a configuration has a high impact on the operating system so that Windows may not do it at all. This means that PRTG can only use 1 CPU with 1 core in this configuration and cannot work properly.
- Use a VMXNET (original, 2 or 3) network adapter instead of an emulated one (Vlance, e1000x). You must install or update the VMware Tools on the guest OS to have a driver for the VMXNET network adapters available.
Resource management for your virtual machines is more important the bigger your environment is. There are more performance issues with environments running several VMs.
To run PRTG on a VM, consider customizing resource allocation settings to obtain the best possible performance for the VM with PRTG. This issue especially concerns the resource types CPU, memory, and storage.
For detailed information about resource management for VMware, see the official documentation vSphere Resource Management.
You have the following options to increase the performance of the virtual machine PRTG is running on:
- Make sure that a fixed amount of the physical memory of your host server is assigned to the virtual machine running PRTG. We recommend 8 GB or more.
- For best performance of your PRTG server(s), ensure that your host system always assigns resources to the virtual machine running PRTG with higher priority than to other VMs, depending on your needs.
Also keep memory ballooning and over-provisioning in mind:
- Memory ballooning: This is a memory over-commitment mechanism for multiple virtual machines while they are running. Memory that was allocated to a virtual machine can be given to another virtual machine without manually changing settings. This can affect performance and stability of PRTG. To avoid this issue, customize your VMware settings as described above.
- Over-provisioning: This is the difference between the physical capacity of CPU and memory, and the logical capacity through the operating system as available for the user. The host can give the VMs more CPU or memory than physically available. Over-provisioning can lead to resource bottlenecks and can affect the performance and stability of PRTG. For this reason, customize your VMware settings as described above.
Potential Problems with VMXNET 3 Virtual NIC Cards and Hardware Offload Engine
We had situations where customers used the VMXNET 3 virtual network cards and encountered corrupted UDP packets (Bad checksum) related to the hardware offload engine used in VMware NIC (this means, disable TCP/UDP checksum offloading on the network adapter from within the guest operating systems). This resulted in UDP packets loss, unstable SNMP monitoring, and other effects.
See the VMware knowledge base article about this particular NIC driver (browse to section VMXNET 3): Choosing a network adapter for your virtual machine
Please take notice that this post is based on our/my experiences, and our environment. For a couple details of our environment look below.
Some general recommendations:
- Proper setup of your virtual environment with tuning is mandatory. Look at your vendors (optimization) recommendations about virtual environments.
- Take time to think about seperating sensors over several probes. Balancing the load over several probes, so DRS can vMotion the probe to the most suitable host.
- Don't use auto-discovery, but really think about what you want/need to monitor. To avoid unneccesary sensors. Use SNMP sensor more preferably than WMI sensors. ( https://kb.paessler.com/knowledgebase/en/topic/4113-how-many-wmi-sensors-are-maximum-on-what-system )
- First, monitor the datacenter hardware layer, especially what maybe potential bottlenecks. Load on storage controllers, networking on physical machines, switches, load on physical machines, etc. etc. This is usefull if you build up the PRTG environment, and resources like storage performance may become an issue. Then you're aware of that.
- Second, start monitoring the virtual servers. Build it up -controlled- and keep track of the performance of the whole PRTG environment and the physical environment.
In our case:
- HP did a health check on our environment. We have an optimized virtual infrastructure.
- We've split up the hardware vlan and the virtual machines vlan on 2 seperate remote probes.
- Also we split up our core application and Citrix to 2 seperated remote probes.
- Initially we install PRTG Remote Probe on a Win2008R2 with 1 Gb memory. Based on system health and the warnings of the probe sensors we increase resources.
- The probes with virtual Windows servers have 2 groups, the Server Layer and the Application Layer. The server layer is based on a device template, the application layer has additional sensors.
About our environment:
- running VMware vSphere 5.1
- using 2 HP enclosures with HP blades, interlinked
- HP VirtualConnect, fibre connected to SAN switches and fibre connected to LAN
- HP EVA SAN storage, fibre connected to SAN switches
- VMware resource pool: normal shares, no reservations, no limitations
- VM's: VMXNET 3 nic drivers, Resources of VM's: default (memory/cpu/disk)
- PRTG Network Monitor 126.96.36.1993 x64
- approx. 200 vm's ( 130x win2008r2 / 70x 2003r2)
- 13504 sensors
Primary and failover (virtual) cluster node: Win 2008 R2 64-bit, normal shares, 2 vcpu, 6 Gb internal memory. 4 virtual probes within the datacenter: Win 2008 R2 64-bit, normal shares, 2 vcpu, 1 Gb internal memory.
Our scanning interval is 60 seconds. We do not use dedicated storage. The 4 virtual probes have: 388, 23, 4289 and 5101 sensors.
sensortypes: 36x esxserverhealthsensorextern, 36x esxserversensorextern, 3x exe, 21x exexml, 1x file, 5x folder, 5x ftp, 70x http, 15x httpadvanced, 25x ldap, 521x ping, 344x port, 29x probestate, 2x ptfadsreplfailure, 23x ptfhttpxmlrestvalue, 104x ptfloggedinusers, 4x ptfpingjitter, 3x ptfscheduledtaskxml, 10x remotedesktop, 1x smbdiskspace, 2x smtp, 1x sniffercustom, 2x snifferheader, 125x snmpciscosystemhealth, 182x snmpcpu, 207x snmpcustom, 6x snmpcustomstring, 479x snmpdiskfree, 2x snmphpphysicaldisk, 3x snmphpsystemhealth, 7593x snmplibrary, 366x snmpmemory, 16x snmprmon, 686x snmptraffic, 1x snmptrap, 185x snmpuptime, 2x sntp, 48x sshesxdiskfreev2, 1x syslog, 29x systemstate, 3x winapieventlog, 27x wmicustom, 43x wmidiskspace, 193x wmieventlog, 6x wmiexchangeserver, 3x wmiexchangetransportqueues, 12x wmilogicaldisk, 17x wmimemory, 47x wminetwork, 16x wmipagefile, 3x wmiphysicaldisk, 2x wmiprocess, 18x wmiprocessor, 285x wmiservice, 141x wmishare, 10x wmisqlserver2008, 77x wmiuptime, 201x wmiutctime, 1209x wmivitalsystemdata, 8x wmivolume
This is a good working virtual PRTG configuration, fully on VMware.
I know, this is against some of the recommendations above. In our case it works, but if it will work in your case. I can't say :)
Please give your comments, experiences to this post. Or can you tell your configuration, and what are your recommendations ?
Thanks for your great "real-world" recap of your virtualized environment. I would be interested to learn a few things if you are still actively participating in this forum:
- Has your environment grown? What is your sensor total now?
- Have you run into any performance issues with this many sensors over the past 2+ years from your initial post?
- What speed HDDs do you utilize in the HP SANs? 10K SAS? 15K SCSI? SSD?
We are planning a massive deployment and are interested to hear stories of others' environments that push the limits, both virtualized and on physical hardware?
My recap was more than 2 years ago. A lot of changes the last years, in IT and personally. I've changed jobs, so current status of the monitoring environment is not clear to me. Also the storage environment has changed. So I can't give a good answer to your 3 questions.
My point was that if your monitor the health of the probes, and make one combined sensor with the health of multiple probes you know the status. And then you can exceed the recommendations of Paessler.
I you want to push limits of an environment you could use Login VSI performance testing tool: http://www.loginvsi.com/ . Also you can take a look at XanGati and ControlUp.
May I suggest a reconsideration of this Post. Several of the arguments stated are severely outdated and it's in your own interrest to update this article. This for example: "Configure your VM to have the CPU cores on the same CPU socket. CPU cores on individual sockets may result in serious performance issues. Scheduling threads over different sockets in such a configuration has a high impact on the operating system so that Windows may not do it at all. This means that PRTG can only use 1 CPU with 1 core in this configuration and cannot work properly."
In the age of NUMA systems and being able to create, or avoid, numa-wide VM's, this argument is moot. And it has been for years. I suggest you work with a VMware expert to update this article.
Feedback from development is that the recommendations are still valid, referencing to these details from Microsoft about scheduling.
This is a great thread and many thanks to Paessler supoprt and Peter for sharing specifics. I think it would be helpful for Paessler to bring in a ESXi subject matter expert to consult on this topic and/or go to a customer site with a large ESXi environment and publish a white-paper with detailed configuration around optimizing for PRTG. I personally see a massive opportunity that will enable Paessler to move beyond the 5,000 sensor disclaimer around virtualized environments. It's understandable that Peassler cannot provide support to troubleshoot performance issues but in this day and age, virtualization is a requirement and the documentation around this is key. For me, the value proposition is principally to address hardware failure. I am also excited about the possibility of the PRTG road-map as it pertains to performance improvement opportunities. I think a database will be a large part of the solution.
I invite others to post details about their PRTG installation, especially if virtualized or planning to virtualize, scaled beyond 5,000 sensors, and what tuning is in place or planned. We have an older, single physical server running Win 2008 R2, core and probe combined. It has two, Dual-Core AMD Opteron(tm) Processor 2218, 2600 Mhz, for a total of 4 Core(s) way too much RAM along with SAN based storage. CPU sits at around 20%. Probe has between 120-130 open requests and it's severely taxed only when adding sensors. There are just under 10,000 sensors which are almost all ping and SNMP. Ping is every 30 seconds, 10 packets for improved packet loss resolution, 5'000ms timeout to aid with high congestion scenarios. SNMP sensors are every 60 seconds. Sensors are predominantly over the WAN. Adding sensors is almost 100% auto discover using device template. For the virtualized environment, we'll add one or two dedicated probes with a core with either no sensors or a proportionally decreased number of sensors on the built in probe. Comments, thoughts?
thank you for your input. We see the increasing use of virtualization, and use a lot of virtual servers ourselves. Regarding documentation, we do our best to distill our experience and the user experience into recommendations. We want to keep it concise, with a useful outline.
As we license PRTG by the sensor count, we want to support large installations. This implies a lot of larger, and smaller changes in the PRTG architecture, in order to remove internal dependencies. It is too early to estimate performance gains. We also want to address the architecture of certain sensors, though again it is to early for details.
We offer a hosted PRTG, running on VMs, making VM performance even more important for us. To do it right, we have to rework many internals.
Just wanted to add my real world experience with a large PRTG server on a virtual server. We are running Vmware ESX 6.5 on Dell hardware. My VM has 12 cores and 48GB of RAM. I use the API extensively to add and tweak devices and sensors. The API causes very heavy memory use and it looks like an area that needs some attention.
When the API isn't being used 24GB is more than enough memory but once I start making calls to clone, modify or delete objects, memory usage starts creeping up to where almost all RAM is consumed by the PRTG Core Server. In fact the server will consume all of the RAM and cause a restart of PRTG.
I am running with more than 40,000 sensors ( will probably be over 50,000 before the end of the week ). They are SNMP sensors and I use a separate remote probe ( 4 core, 16GB ).
A remote probe does not provide the scaling I imagined because the bottleneck seems to be disk performance for the database on the core server. The "database" seems to be a folder with lots of sub-folders and files. Once you are running with lots of devices and sensors, the system is opening and closing, reading and writing to files at a very high rate.
I hit a wall with my storage last week and had to remove a newly added set of sensors ( three sensors for each of 6,000 devices ) in order to stabilize the system. Once I got my Server Team to upgrade me to pure SSD backed SAN storage ( 8gbps fiber channel ), the problem cleared. My disk went from 90% or more active time down to just a few percent. Queue length dropped away to nothing.
So my takeaway is that you can make PRTG work in Vmware but you need to be prepared to dedicate serious resources to it. My server team don't want to deploy real servers, or local storage on our blades but they have to be prepared to provide the equivalent within the framework of virtualized servers with remote storage.
I am standing up a new PRTG server that will be of similar size and I am starting with 64GB RAM and making sure I can scale to 128GB if needed. I am hoping that Paessler will look at the API and optimize it in some way. Apart from RAM consumption, web server responsiveness is impacted by API calls. For now, I am going to incorporate a check for RAM exhaustion in my scripts, and pause the script until RAM usage drops back to something reasonable.
thank you for providing this practical feedback. It matches our observation, that disk performance (not linear read/write speed, instead the ability to handle a lot of small operations) often is the bottleneck.