Thursday, September 26, 2024

Microsoft Storage Spaces Direct (S2D): Cluster Shared Volume IO Datapath

Introduction

In the previous articles, we explored how Microsoft Storage Spaces Direct (S2D) and StarWind Virtual SAN (VSAN) performed under different configurations in a 2-node Hyper-V cluster setup, focusing on NVMe/TCP in the first article and RDMA-based setups in the second article.

In this article, we shift our focus specifically to the Mirror-Accelerated Parity mode in Microsoft Storage Spaces Direct (S2D). Our goal is to determine the optimal configuration and parameters to achieve maximum performance for this setup. While conducting our benchmarking, we came across useful insights from several sources, including this, this, and this, which helped shape our approach to testing.

One of the critical aspects we noticed was the behavior of file system redirection. While checking the status of one of our volumes, we saw that the “StateInfo” was marked as “FileSystemRedirected,” as depicted in Figure 1. This essentially means that the I/O is being sent over SMB to the coordinator node, which handles the I/O stack like its own. To put it simply, when a non-owner node initiates data requests, these are routed over SMB to the owner node for processing.

A computer screen shot of a blue screen Description automatically generated

 

Figure 1

 

We were curious to see whether this redirection behavior was actually affecting performance, so we set out to confirm it through a series of tests. Let’s dive into what we found!

Testing methodology:

To verify the impact of file system redirection behavior, we conducted a series of tests using the 1M read and 1M write patterns across two distinct scenarios:

  1. The test VM is running on the node that owns the volume.
  2. The test VM is running on a node that does not own the volume.

For these tests, our virtual machine (VM) was hosted on node sw-node-01, with its virtual disks placed on Volume01 (see Figure 2).

For these tests, our virtual machine (VM) was hosted on node sw-node-01, with its virtual disks placed on Volume01

 

Figure 2

 

  • In the first scenario, sw-node-01 was the owner of Volume01 (Figure 3).

In the first scenario, sw-node-01 was the owner of Volume01

 

Figure 3

 

  • In the second scenario, sw-node-02 was the owner of Volume01 (Figure 4).

In the second scenario, sw-node-02 was the owner of Volume01

 

Figure 4

 

To maintain clarity and control over the variables, we capped the test performance at 1GiB/s and limited network usage to a single cluster network. The goal was to observe how I/O behaved under these controlled circumstances.

In the VM, we used the following FIO parameters to measure performance for both read and write operations:

  • 1M Read:
fio --name=read --numjobs=1 --iodepth=16 --bs=1024k --rw=read --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/sdb --runtime=60 --time_based=1 --rate_iops=1000
  • 1M Write:
fio --name=write --numjobs=1 --iodepth=16 --bs=1024k --rw=write --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/sdb --runtime=60 --time_based=1 --rate_iops=1000

During the testing, we closely monitored and recorded the values for the following performance counters:

  • Read Bytes/sec and Write Bytes/sec at the Cluster CSVFS level.
  • Disk Read Bytes/sec and Disk Write Bytes/sec on the mirror-accelerated parity virtual disk level.
  • Bytes Total/sec on the cluster network level.

By isolating these performance metrics, we aimed to determine how file system redirection influenced I/O and overall network usage across the two scenarios.

1M Read Test: VM Running on Volume Owner Node

In the 1M read test, where the VM is running on the same node that owns the volume, we get a clear picture of how local read operations behave.

1M Read Test: VM Running on Volume Owner Node

Figure 5

 

The performance monitor screenshot, Figure 5, shows that the physical disk on sw-node-01 is reading data at the same speed of 1GiB/s, while the network utilization is negligible. The Bytes Total/sec for the network interfaces remains around 90 KiB/s, which is insignificant compared to the disk read speeds, indicating that the network is not being utilized in this operation.

This confirms that the read operations are handled locally, meaning the data is being directly accessed from the storage on the same node (sw-node-01), without requiring the network to transfer data between nodes. The I/O request does not need to be redirected to another node, which results in lower latency and maximizes the read performance.

1M Read Test: VM Running on Non-Owner Node

In this scenario, we ran a 1M read test with the VM running on a node that doesn’t own the volume. If we check the performance monitor data (Figure 6), the actual disk read is happening on sw-node-02 – the node that owns the volume.

1M Read Test: VM Running on Non-Owner Node

 

Figure 6

 

But the catch is that all this data isn’t staying on sw-node-02. Since the VM is on sw-node-01, the data has to travel over the network from sw-node-02 back to sw-node-01. And you can clearly see this in the network traffic: both nodes are pushing around 1 GiB/s over their network interfaces.

In short, when the VM isn’t on the same node as the volume owner, the read performance relies heavily on the network. While the disk does its job, the network can be the bottleneck because it’s handling all the data transfer. This highlights how important it is to keep VMs on the same node as the volume owner if you want to avoid unnecessary network load.

1M Write Test: VM Running on Volume Owner Node

In this particular test scenario, we are observing the performance of a 1M sequential write operation where the test VM is running on the node that owns the volume.

1M Write Test: VM Running on Volume Owner Node

Figure 7

 

From Figure 7, we can gather a couple of key performance indicators. On sw-node-01, which is the owner of the volume, the physical disk is handling writes at 1 GiB/s.

But the real action is happening on the network interface. The Mellanox ConnectX-6 adapter shows a whopping 2 GiB/s of traffic. Why so much? Well, that’s because the data being written on sw-node-01 isn’t just staying there – mirrored data is replicated across the network to sw-node-02. The same amount of traffic shows up on sw-node-02’s network adapter, confirming that all this data is being synced between the two nodes in real time to maintain redundancy.

1M Write Test: VM Running on Non-Owner Node

So, in this 1M write test, we’re dealing with a VM that’s running on sw-node-01, which is not the owner of the volume. This setup is interesting because it adds an extra layer of complexity.

1M Write Test: VM Running on Non-Owner Node

 

Figure 8

 

Moving on to Figure 8 from the performance monitor, we see that on sw-node-02, which actually owns the volume, the physical disk is handling writes at 1 GiB/s. So, no surprises there.

Now, look at the network utilization on both nodes. The Mellanox ConnectX-6 adapters on sw-node-01 and sw-node-02 are showing traffic of around 3 GiB/s each. Why is that? Well, since sw-node-01 doesn’t own the volume, all the data being written needs to go over the network to sw-node-02 first. Once it’s written to disk on sw-node-02, it gets mirrored data replicated back to sw-node-01 for redundancy.

Therefore, in a scenario where the VM isn’t running on the volume-owning node, we’ve got 1 GiB/s write speeds to the local disk, while pushing 3 GiB/s over the network to ensure that data is mirrored and kept safe.

Conclusion

Our test results confirm a key behavior: when a VM runs on a node that doesn’t own the volume, I/O requests are redirected over the network to the volume’s owner node. This redirection provides a longer I/O data path and introduces network overhead, which can impact performance, especially under high-load scenarios. On the other hand, when the VM is running on the node that owns the volume, data is read from the virtual disk locally, bypassing the network stack entirely.

During write operations, network utilization is reduced as well, since the data doesn’t need to be redirected to the volume owner node before being written. Instead, it’s written locally and then mirrored across to the other node. This can significantly reduce network load, improving overall performance.

To maximize efficiency, it’s optimal to run VMs on the node that owns the volume. This was the approach we followed in our series of benchmarking articles comparing StarWind Virtual SAN (VSAN) and Microsoft Storage Spaces Direct (S2D) in a 2-node Hyper-V cluster setup. In case you missed them, our previous articles explored the performance of these solutions under NVMe-oF over TCP and RDMA-based setups.

It’s important to consider this behavior in your production environment. If network bandwidth becomes a bottleneck, especially when VMs are running on non-owner nodes, it can lead to performance issues. Proper VM placement on the volume owner node can mitigate these potential problems and ensure smoother operation.

Stay tuned for more insights in our upcoming articles, where we’ll continue exploring the nuances of other solutions and provide deeper analysis to help you refine your IT strategy.



from StarWind Blog https://ift.tt/gp45m6n
via IFTTT

No comments:

Post a Comment