Last night I performed an upgrade to SANiQ *ahum* LeftHand OS 10.5 from SANiQ 9.5 on 16 HP LeftHand P4500 G2 storage nodes and want to share a couple of things I learned from this process.
Before actually upgrading I spent some time analysing the possible risks and impact.
HP states that when using the CMC (Centralized Management Console) no downtime whatsoever should occur. This is possible due to the fact that CMC never reboots storage nodes simultaneously when they are the ones responsible for a specific LUN (which is, ofcourse, protected by Network Raid-10).
The possibility that we would suffer data loss was nil and reading other people’s experience with upgrading the storage nodes in combination with VMware was nothing but positive.
Still we wanted to take no risk at all and scheduled an extra backup, right before upgrading the nodes. The backup was performed after regular office hours (6 PM) so if disaster would strike, the least amount of user data would be lost. Running all (7) Veeam backup jobs at the same time took a while to complete (5 hours approximately) and after that I was good to go.
I started the upgrade process around 11 PM and actively monitored all of our systems. Not a single error or warning came by and no downtime was experienced (except the storage nodes themselves of course while they were rebooting).
The HP FOM (Failover Manager) was upgraded first and next the storage nodes were upgraded. They all power cycled and some had to restripe before the process continued. After all nodes were rebooted and upgraded, CMC installed another patch on all systems after they all had another power cycle. This process took about 5 hours to complete.
I performed a check after the upgrade completed and concluded that only minor issues occured:
- SQL service on two VMs was stopped, not sure if this is a coincidence or due to the upgrade. Manually started the services OK.
- Disk access lost on some ESXi hosts, but shortly after the access was resumed automatically.
- One VM was marked as ‘inaccessible’. Removed it from inventory and re-added it to solve.
So, no major issues but quite some time to complete.
Oh; you should increase the Bandwith Priority of your Management Group inside CMC to increase the speed which will be used to restripe your nodes. I changed this from 16 MB/sec (default) to 40 MB/sec to decrease the total time needed to restripe.
My conclusion is that CMC is a great tool to perform an unattended upgrade of storage clusters. I would trust the tool even without running a backup prior to the process. Still I would recommend running the upgrade in off-hours due to the path failovers, restriping and possible latency spikes.