VSAN – find the bottleneck! (part 4)

Every solution has a bottleneck.  After setting up a 3-node VSAN array with this hardware, I/O performance did not meet initial expectations, especially with write throughput at just over 200 MB/s:

horizon-vsan-stripe1

There was a bottleneck somewhere, but several items could cause this including:

  • VMware storage drivers
    • If the lsi-mr3 driver is replaced by the megaraid_perc9 will speed increase?
    • Is the VMware nvme driver an issue?
  • Firmware
    • PERC H330
    • Intel P3700 SSD
    • Samsung 850 EVO
    • PowerEdge R630 backplane
  • VSAN configuration
    • How much does VSAN storage policy determine I/O performance?

To rule out a VSAN issue, I removed one of the hosts from the cluster, re-enabled RAID mode and initialized a new RAID-10 disk array with the 4 Samsung 850 EVO SSDs.  Results were shockingly bad at under 30 MB/s write:

raid10 lsimr3

Reverting the driver from the VMware recommended lsi-mr3 for ESXi 6 to the megaraid_perc9 driver for ESXi 5.5 actually doubled the speed at over 50 MB/s, but still wasn’t anything impressive:

raid 10 megaraid

At this point, it was pretty clear that the PERC card was the issue.  To ensure the Samsung SSDs weren’t the issue, I ran the same test on another standalone server that used the same SSDs but had the PERC H730 instead of the H330.  The results were exponentially better (there are 8 SSDs in this RAID 10 instead of 4):

Screen Shot 2016-04-02 at 2.18.31 PM

Now that the PERC H330 was identified as the bottleneck, was there any way to improve write speed in the VSAN cluster without replacing hardware?  The good news is yes, VSAN does a great job of using the Intel SSD cache drive quite often to improve I/O.

To improve it even further, I created a new storage policy that used a disk stripe size of 4 instead of the default 1.  My guess here was that if data blocks were forced to replicate across all 3 physical hosts in the cluster then guest reads and writes would almost always be handled by three disk groups.  The results were better than the initial VSAN test:

horizon-vsan-stripe4

In sum?  VSAN does what it is supposed to do: use cache as much as possible to deliver maximum performance.  That being said, caching isn’t everything.  A single standalone host using the H730 with twice the capacity SSDs but no high performance caching SSD still significantly outperforms the 3-node VSAN cluster on the H330.  Without VSAN though, anything running on local storage handled by a PERC H330 would take a significant performance hit.  With VSAN, these hosts can actually be used for a small standard workload VDI environment.  Plotting all 4 results on a chart shows how much of boost VSAN provides compared to using each host in RAID mode:

Screen Shot 2016-04-02 at 1.48.41 PM

The PERC H330 just can’t handle SSD workloads.  It has passthrough and HBA mode which works well with VSAN, but when it comes to actual disk tests, results are disappointing.

In sum: use the best PERC card offered from Dell when SSDs are in use.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s