Cloudy Journey: Project Karcygwins and Virtualised Storage Performance [feedly]

Shared via feedly // published on Latest blog entries // visit site

Introduction

Over the last few years we have witnessed a revolution in terms of storage solutions. Devices capable of achieving millions of Input/Output Operations per Second (IOPS) are now available off-the-shelf. At the same time, Central Processing Unit (CPU) speeds remain largely constant. This means that the overhead of processing storage requests is actually affecting the delivered throughput. In a world of virtualisation, where extra processing is required in order to securely pass requests from virtual machines (VM) to storage domains, this overhead becomes more evident.

It is the first time that such an overhead became a concern. Until recently, the time spent within I/O devices was much longer than that of processing a request within CPUs. Kernel and driver developers were mainly worried about: (1) not blocking while waiting for devices to complete; and (2) sending requests optimised for specific device types. While the former was addressed by techniques such as Direct Memory Access (DMA), the latter was solved by elevator algorithms such as Completely Fair Queueing (CFQ).

Today, with the large adoption of Solid-State Drives (SSD) and the further development of low-latency storage solutions such as those built on top of PCI Express (PCIe) and Non-Volatile Memory (NVM) technologies, the main concern lies in not losing any unnecessary time in processing requests. Within the Xen Project community, some development already started in order to allow scalable storage traffic from several VMs. Linux kernel maintainers and storage manufacturers are also working on similar issues. In the meantime, XenServer Engineering delivered Project Karcygwins which allowed a better understanding of current bottlenecks, when they are evident and what can be done to overcome them.

Project Karcygwins

Karcygwins was originally intended as three separate projects (Karthes, Cygni and Twins). Due to their topics being closely related, they were merged. Those three projects were proposed based on subjects believed to be affecting virtualised storage throughput.

Project Karthes aimed at assessing and mitigating the cost in mapping (and unmapping) memory between domains. When a VM issues an I/O request, the storage driver domain (dom0 in XenServer) requires access to certain memory areas in the guest domain. After the request is served, these areas need to be released (or unmapped). This is also an expensive operation due to flushes required in different cache tables. Karthes was proposed to investigate the cost related to these operations, how they impacted the delivered throughput and what could be done to mitigate them.

Project Cygni aimed at allowing requests larger than 44 KiB to be passed between a guest and a storage driver domain. Until recently, Xen's blkif protocol defined a fixed array of data segments per request. This array had room for 11 segments corresponding to a 4 KiB memory page each (hence the 44 KiB). The protocol has since been updated to support indirect I/O operations where the segments actually contained other segments. This change allowed for much larger requests at a small expense.

Project Twins aimed at evaluating the benefits of using two communication rings between dom0 and a VM. Currently, only one ring exists and it is used both for requests from the guests and responses from the back end. With two rings, requests and responses can be stored in their own ring. This new strategy allows for larger inflight data and better use of caching.

Due to initial findings, the main focus of Karcygwins stayed on Project Karthes. The code allowing for requests larger than 44 KiB, however, was constantly included in the measurements to address the goals proposed for Project Cygni. The idea of using split rings (Project Twins) was postponed and will be investigated at a later stage.

Visualising the Overhead

When a user installs a virtualisation platform, one of the first questions to be raised is: "what is the performance overhead?". When it comes to storage performance, a straightforward way to quantify this overhead is to measure I/O throughput on a bare metal Linux installation and repeat the measurement (on the same hardware) from a Linux VM. This can promptly be done with a generic tool like dd for a variety of block sizes. It is a simple test that does not cover concurrent workloads or greater IO depths.

Looking at the plot above we can see that, on a 7.2k RPM SATA Western Digital Blue WD5000AAKX, read requests as large as 16 KiB can reach the maximum disk throughput at just over 120 MB/s (red line). When repeating the same test from a VM (green and blue lines), however, we see that the throughput for small requests is much lower. They eventually reach the same 120 MB/s mark, but only with larger requests.

The green line represents the data path where blkback is directly plugged to the back end storage. This is the kernel module that receive requests from the VM. While this is the fastest virtualisation path in the Xen world, it lacks certain software-level features such as thin-provisioning, cloning, snapshotting and the capability of migrating guests without centralised storage.

The blue line represents the data path where requests go through tapdisk2. This is a user space application that runs in dom0 and can implement the VHD format. It also has an NBD plugin for migration of guests without centralised storage. It allows for thin-provisioning, cloning and snapshotting of Virtual Disk Images (VDI). Because requests transverse more components before reaching the disk, it is understandingly slower.

Using Solid-State Drives and Fast RAID Arrays

The shape of the plot above is not the same for all types of disks, though. Modern disk setups can achieve considerable higher data rates before flattening their throughputs.

Looking at the plot above, we can see a similar test executed from dom0 on two different back end types. The red line represents the throughput obtained from a RAID0 formed by two SSDs (Intel DC S3700). The blue line represents the throughput obtained from a RAID0 formed by two SAS disks (Seagate ST). Both arrays were measured independently and are connected to the host through a PERC H700 controller. While the Seagate SAS array achieves its maximum throughput at around 370 MB/s when using 48 KiB requests, the Intel SSD array continues to speed up even with requests as large as 4 MiB. Focusing on each array separately, it is possible to compare these dom0 measurements with measurements obtained from a VM. The plot below isolates the Seagate SAS array.

Similar to what is observed on the measurements taken on a single Western Digital, the throughput measured from a VM is smaller than that of dom0 when requests are not big enough. In this case, the blkback data path (the pink line) allows the VM to reach the same throughput offered by the array (370 MB/s) with requests larger than 116 KiB. The other data paths (orange, cyan and brown lines) represent user space alternatives that reach different bottlenecks and even with large requests cannot match the throughput measured from dom0.

It is interesting to observe that some user space implementations vary considerably in terms of performance. When using qdisk as the back end along the blkfront driver from the Linux Kernel 3.11.0 (the orange line), the throughput is higher for requests of sizes such as 256 KiB (when compared to other user space alternatives -- the blkback data path remains faster). The main difference in this particular setup is the support for persistent grants. This technique, implemented in 3.11.0, reuses memory grants and drastically reduces the map and unmap operations. It requires, however, an additional copy operation within the guest. The trade-off may have different implications when varying factors such as hardware architecture and workload types. More on that on the next section.

When repeating these measurements on the Intel SSD array, a new issue came to light. Because the array delivers higher throughput with no signs of abating as larger requests are issued, none of the virtualisation technologies are capable of matching the throughput measured from dom0. While this behaviour will probably differ with other workloads, this is what has been observed when using a single I/O thread with queue depth set to one. In a nutshell, 2 MiB read requests from dom0 achieves 900 MB/s worth of throughput while a similar measurement from one VM will only reach 300 MB/s when using user space back ends. This is a pathological example chosen for this particular hardware architecture to show how bad things can get.

Understanding the Overhead

In order to understand why the overhead is so evident in some cases, it is necessary to take a step back. The measurements taken on slower disks show that all virtualisation technologies are somewhat slower than what is observed in dom0. On such disks, this difference disappears as requests grow in size. What happens at that point is that the actual disk becomes "maxed out" and cannot respond faster no matter the request size. At the same time, much of the work done at the virtualisation layers do not get slower proportionally to the amount of data associated with requests. For example, interruptions between domains are unlikely to take longer simply because requests are bigger. This is exactly why there is no visible overhead with large enough requests on certain disks.

However, the question remains: what is consuming CPU time and causing such a visible overhead on the example previously presented? There are mainly two techniques that can be used to answer that question: profiling and tracing. Profiling allows instruction pointer samples to be collected at every so many events. The analysis of millions of such samples reveals code in hot paths where time is being spent. Tracing, on the other hand, measures the exact time passed between two events.

For this particular analysis, the tracing technique and the blkback data path have been chosen. To measure the amount of time spent between events, the code was actually modified and several RDTSC instructions have been inserted. These instructions read the Time Stamp Counters (TSC) and are relatively cheap while providing very accurate data. On modern hardware, TSCs are constant and consistent across cores of a host. This means that measurements from different domains (i.e. dom0 and guests) can be matched to obtain the time passed, for example, between blkfront kicking blkback. The diagram below shows where trace points have been inserted.

In order to gather meaningful results, 100 requests have been issued in succession. Domains have been pinned to the same NUMA node in the host and turbo capabilities were disabled. The TSC readings were collected for each request and analysed both individually and as an average. The individual analysis revealed interesting findings such as a "warm up" period where the first requests are always slower. This was attributed to caching and scheduling effects. It also showed that some requests were randomly faster than others in certain parts of the path. This was attributed to CPU affinity. For the average analysis, the 20 fastest and slowest requests were initially discarded. This produced more stable and reproducible results. The plots below show these results.

Without persistent grants, the cost of mapping and unmapping memory across domains is clearly a significant factor as requests grow in size. With persistent grants, the extra copy on the front end adds up and results in a slower overall path. Roger Pau Monne, however, showed that persistent grants can improve aggregate throughput from multiple VMs as it reduces contention on the grant tables. Matt Wilson, following on from discussions on the Xen Developer Summit 2013, produced patches that should also assist grant table contention.

Conclusions and Next Steps

In summary, Project Karcygwins allowed the understanding of several key elements in storage performance for both Xen and XenServer:

The time spent in processing requests (in CPU) definitely matters as disks get faster
Throughput is visibly affected for single-threaded I/O on low-latency storage
Kernel-only data paths can be significantly faster
The cost of mapping (and unmapping) grants is the most significant bottleneck at this time

It also raised the attention on such issues with the Linux and Xen Project communities by having these results shared over a series of presentations and discussions: