During a project at one of our customers, we were facing very strange performance issues on migrated VMs that were running on legacy hardware before (vSphere-based), but were performing worse on newer hardware. This article is a must read for SimpliVity customers running on HPE Gen9 hardware as you could be impacted as well!
The customer was running their workloads on legacy IBM X3650-M3 server hardware and an IBM StoreWise V7000 storage array. This environment was running both desktop- and server workloads using Citrix XenApp 7.8 for desktop purposes and VMware vSphere 5.5 for virtualization and provisioning of virtual machines (including the Citrix virtual machines).
This environment was migrated to fresh SimpliVity clusters based on HPE Gen9 servers with local SSD’s for storage, the most recent Intel processors and enough memory to provide the required capacity.
This transformation was expected to deliver much higher performance, but especially in the desktop workloads (which was based on Citrix technologies), the actual performance was really bad. Basically every desktop user was complaining about bad performance.
It was noted that opening Internet Explorer and browsing, would basically hang the whole session and CPU load using Task Manager in Windows showed 80-100% load, until Internet Explorer was closed again.
Also, the desktop image was unmodified (other than upgrading VMware Tools). So why was this same image performing so bad on the new hardware?
After I arrived onsite, I went on checking the performance statistics using esxtop on each host. The CPU ready times of the Citrix servers were very high (showing values of 8, 15, 22 and even 25). CPU ready should remain below 5 and the closer it is to 0, the better.
As the number of VMs and configured vCPUs was lower than the available physical capacity (no oversubscription), this is not the expected behaviour.
I checked the configuration of vSphere, reviewed ESXi and Windows logs, compared the software list and searched for any changes in the system32 folder, but no luck there!
One thing that came up in mind was the power settings of the system/CPU. I verified this inside the ESXi configuration, which showed a static high performance setting. Strange.
After all this checking, one of my colleagues went on to check the iLO (Integrated Lights-Out) settings. Here, the power setting was showing Dynamic Power Savings Mode!
After changing this setting to Static High Performance Mode, performance improved drastically and immediately!
Before SimpliVity started delivering their system using HPE (which is now their only option because of the acquisition of SimpliVity by HPE), the BIOS settings were basically locked down. A new SimpliVity node was prepared using a so-called Integration Service, offered by SimpliVity and included setting this power setting to high performance or equivalent based on server vendor.
We discussed this issue with HPE and they confirmed that this setting should be automatically set and was not the case with these pieces of hardware.
I expect that there’s a lot of customers running SimpliVity in HPE hardware and are not getting the performance that the platform can deliver. So make sure you check your systems! I’m not sure whether this is only on the recent Gen9 models, so if you can feed back something to me, please do so in the comments.
Setting the power settings to Static High Performance Mode is a solution to this issue. However, setting it to OS Control Mode and applying the high performance setting using ESXi settings should also work.
But wait, there’s more
HPE also mentioned that SimpliVity nodes that are based on medium or large factors, should have their running OVC (OmniStack Virtual Controller) changed from 4 vCPUs to 6 vCPUs.
Assumptions are the mother of all fuck-ups! I assumed that having the correct power setting in vSphere would also mean a correct setting in the BIOS. Tweaking your BIOS before going into production is one of the best practices and is known by many people.
So, when troubleshooting check everything and assume nothing.
After finding out this solution, we went on and checked our other customer environments where we found more systems wrongly configured. We received very positive feedback on this change.
Even if you’re not running into issues right now, be sure to go and check your settings as you could face a massive improvement in performance!